Basics of web scraping

Introduction

In this post, we will cover the basic steps of web scraping. As an example, we will use the Wikipedia website to collect some simple information from a web page.

Web scraping is the process of extracting information from websites. It typically involves the following steps:

Identify the webpage URL Make sure you have the correct and complete URL of the target website. Before scraping, always check whether it is allowed by reviewing the site’s robots.txt file (e.g., https://example.com/robots.txt) and the Terms of Service.
Analyze the website structure Use your browser’s Developer Tools (right-click → Inspect) to explore the structure of the HTML document. This helps you locate where the information of interest is stored and what HTML tags, classes, or IDs are used.
Identify the target data Determine which elements contain the data you need. Note their HTML patterns — for example, data might be inside <div> tags with specific classes, or within <table> or <a> elements.
Extract the data Use Python libraries such as Beautiful Soup, Requests, or Scrapy to retrieve and parse the webpage. You can extract text, links, or attributes (like URLs or image sources) from the HTML elements you identified.
Save the data Store the collected data in a structured format that can be used for further analysis — for example, as a CSV, JSON, or Pandas DataFrame. This allows easy processing and integration with data analysis workflows.

Important considerations

Always respect the website’s Terms of Service and the rules specified in robots.txt.
Be polite: include a User-Agent header that identifies your script, avoid sending too many requests too quickly, and do not overload servers.
Never collect personal or sensitive data, and never attempt to bypass paywalls or security measures.
Handle errors gracefully: implement retries, timeouts, and error handling for blocked or failed requests.
If the website provides an API, use it instead of scraping whenever possible — APIs are more stable and efficient.

About `robots.txt`

Before scraping any website, it is good practice to check its robots.txt file. This file tells automated agents (such as web scrapers and search engine crawlers) which parts of the site they are allowed or not allowed to access.

A detailed description of the standard can be found here: https://en.wikipedia.org/wiki/Robots.txt

Every website may define its own rules. For example, Wikipedia’s robots.txt file is available at: https://en.wikipedia.org/robots.txt

What `robots.txt` is and is not

robots.txt contains rules such as:

1
2
3


User-agent: *
Disallow: /some/path/
Allow: /another/path/

A few important points:

The robots.txt file is a guideline, not a legally binding document — but it should be respected as part of good scraping practice.
It tells you what the website owner prefers automated agents to avoid.
It helps you understand which areas of the site may be heavy to crawl or reserved for human users.
A Disallow rule does not protect data from access — it only asks automated tools not to fetch those paths.

For Wikipedia specifically

Wikipedia is generally friendly to scraping for educational and research purposes.
The robots.txt file is relatively permissive and allows crawling of most public content.
Sections that are disallowed typically relate to internal tools, dynamic pages, or pages that would cause unnecessary load.

Libraries

Useful Python libraries for web scraping:

requests – sends HTTP requests and retrieves webpage content. It allows you to set headers, handle cookies, manage sessions, and work with status codes.
BeautifulSoup – parses HTML/XML documents and makes it easy to navigate, search, and extract data from the page’s structure.
pandas – organizes extracted data into DataFrames and provides tools for cleaning, transforming, and saving results (e.g., CSV, JSON).
user_agent – generates realistic User-Agent strings to include in request headers, helping your scraper mimic a browser.
lxml – fast HTML/XML parser used by BeautifulSoup to improve performance and handle imperfect markup.

Installing Required Libraries

Before starting with web scraping, install the essential Python libraries. These packages enable you to send HTTP requests, parse HTML, manipulate data, and generate user-agent strings for polite scraping.

1

!pip install requests beautifulsoup4 lxml pandas user_agent

1
2
3
4


import requests
from bs4 import BeautifulSoup
from user_agent import generate_user_agent
import pandas as pd

Define the URL to scrape

1

url = "https://en.wikipedia.org/wiki/Main_Page"

Define request headers

Define the headers either manually or using user_agent library

Option 1: Define headers manually

1
2
3
4


headers = {
    "User-Agent": "Mozilla/5.0 (educational scraper)"
}
print(headers)

Option 2: Generate a realistic User-Agent automatically

1
2
3
4
5


ua = generate_user_agent()
headers = {
    "User-Agent": ua
}
print(headers)

Send the request

Let’s request a connection to the defined URL, using the User-Agent header. We will also set timeout=15, which means the request will wait up to 15 seconds for a server response before giving up.

After sending the request, we call r.raise_for_status(). This method will:

Raise an error if the server returns an unsuccessful status code (e.g., 404, 403, 500).
Stop the script early instead of continuing with invalid or empty content.
Make debugging easier because you immediately see what went wrong.

1
2


r = requests.get(url, headers=headers, timeout=15)
r.raise_for_status()

Status Code

When we send a request to a webpage, the server responds with an HTTP status code. This code tells us whether the request was successful or if something went wrong.

We can check the status code like this print(r.status_code).

Typical status codes you will encounter:

200 OK – The request was successful, and the page content is available.
301 / 302 Redirect – The page has moved to another URL.
403 Forbidden – Access is denied (often happens when scraping without headers).
404 Not Found – The page does not exist.
429 Too Many Requests – You are sending requests too quickly.
500+ Server Errors – Problems on the server side.

A good response for scraping is usually status code 200.

Because we already used r.raise_for_status(), any error code (400 or 500 range) will automatically raise an exception, stopping the script and helping us diagnose the issue quickly.

1

print(r.status_code)

Parsing the page

Now we can parse the downloaded HTML using BeautifulSoup.
BeautifulSoup needs a parser – the most common choices are:

"lxml" – fast, robust, and good with messy HTML (recommended if lxml is installed).
"html.parser" – pure Python parser included in the standard library (no extra install, but usually slower and less tolerant).

We will use the response object r created earlier.

A few notes:

We use r.content (bytes) instead of r.text, so the parser can detect encoding more reliably.
The resulting soup object is a parsed representation of the page, and we can now search it for tags, classes, IDs, links, etc.

1
2
3
4
5


# Option 1 (recommended): use lxml
soup = BeautifulSoup(r.content, "lxml")

# Option 2: use the built-in html.parser
# soup = BeautifulSoup(r.content, "html.parser")

Extract text and attributes of interest

Before extracting data, it is useful to understand a few common HTML elements you will encounter:

<title> – contains the title of the webpage (what appears in the browser tab).
<h1>, <h2>, <h3>, … – heading elements used to structure content; <h1> is usually the main page title.
<a> – anchor tag used for hyperlinks; it often contains both visible text and an href attribute with the actual URL.

To locate these elements in the parsed document, BeautifulSoup offers two main methods:

find() – returns the first occurrence of a tag.
find_all() – returns all matching tags as a list.

You can search by tag name alone or include additional filters such as class, id, or other attributes.

These methods allow you to explore the structure of the page and identify where the information you want is located.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


# Find title
title = soup.find('title')
print(title)

# Find the first <h1> tag
h1_tag = soup.find("h1")
print(h1_tag)

# Find all <a> tags
all_links = soup.find_all("a")
print(all_links)

If we want to extract only the text content of the <title> element (without the HTML tags), we can use the .text attribute or .get_text() method. This will return the plain text inside the <title> tag.

Using .text or .get_text(strip=True) removes the surrounding HTML and gives you a clean string you can use in your analysis.

1

print(soup.find('title').text)

Now let’s see how to extract different kinds of information from a single tag—for example, an <h1> element.

Explanation:

tag_h1 – stores the first <h1> tag found on the page.
tag_h1.text – returns only the visible text inside the <h1> tag.
tag_h1["id"] – returns the value of the tag’s id attribute. (You can access any other attribute the same way, e.g., ["class"], ["title"].)

This pattern applies to any HTML tag: you can extract the whole tag, its text, or any of its attributes.

1
2
3
4
5


tag_h1 = soup.find("h1")

print("Full tag: ", tag_h1)
print("Just text: ", tag_h1.text)
print("Id of this tag: ", tag_h1["id"])

When we use find_all() to extract all <a> tags from the page, the result can be very large because most webpages contain many links.
To control this, we can use the limit parameter to return only the first n matches.

This will return a list with only the first five <a> tags found on the page.

Using limit is helpful when:

you want to preview the structure of the elements before extracting all of them,
you want to avoid printing hundreds of results,
or you want to quickly test your code without processing the entire page.

1

print(soup.find_all("a", limit=5))

We can also use the select() method, which works similarly to find_all() but uses CSS selectors instead of tag names and attribute dictionaries.
This makes it more flexible and often more concise, especially when selecting elements by class, ID, or hierarchy.

Explanation:

#sister-projects-list selects the element with id="sister-projects-list".
li selects all list items inside that element.
a.extiw selects <a> tags with class extiw within each <li>.

select() returns a list of matching elements (just like find_all()), but uses the familiar CSS selector syntax you may know from web development.

In many cases, select() allows for more precise and readable queries than find() or find_all().

1

print(soup.select('#sister-projects-list li a.extiw'))

The way we search for elements usually gives us the full tag, but in practice we rarely need the entire HTML element.
More often we want:

the text inside the tag → use .get_text() or .text
the value of an attribute (like href) → use .get("attribute_name")

This allows us to extract clean, structured information instead of raw HTML.

Below is an example of extracting all links to Wikipedia’s sister projects.
We will store the results in a dictionary and then convert them into a Pandas DataFrame.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


sister_projects = []

for a in soup.select('#sister-projects-list li a.extiw'):
    title = a.get_text(strip=True)
    href = a.get("href")
    sister_projects.append({"title": title, "url": href})

# Show results
for sp in sister_projects:
    print(f"{sp['title']}: {sp['url']}")

We can put the extracted data into a Pandas DataFrame:

1
2


df = pd.DataFrame(sister_projects)
df

It gives us a clean tabular structure like:

title	url
Commons	https://commons.wikimedia.org/wiki/
MediaWiki	https://www.mediawiki.org/wiki/
Wikibooks	https://en.wikibooks.org/wiki/
…	…

We can also save it using:

1

df.to_csv("sister_projects.csv", index=False)