Project: Simple Web Scraper with Requests and Beautiful Soup

Welcome to Chapter 16: Project: Simple Web Scraper!

Hello, coding adventurers! Are you ready to dive into a super practical and incredibly fun project? In this chapter, we’re going to build our very first web scraper! This means we’ll write a Python program that can visit a website, read its content (just like you do), and then extract specific pieces of information we’re interested in.

This skill is incredibly powerful. Imagine needing to collect data from many web pages, track prices, or monitor news headlines – web scraping allows your Python programs to do this automatically! We’ll be using two fantastic Python libraries: requests to fetch the web page content and Beautiful Soup to elegantly parse and navigate the HTML.

Before we begin, make sure you’re comfortable with basic Python concepts like variables, lists, loops, and functions. A little understanding of how websites work (like knowing what HTML is and what a URL represents) will also be helpful, but don’t worry, we’ll explain everything as we go! Let’s get scraping!

What is Web Scraping? Your Digital Assistant for the Web

Imagine you need to gather all the quotes from a specific website. You could sit there, manually copy-pasting each one. Or, you could teach your computer to do it for you in seconds! That’s essentially what web scraping is: using a program to automatically extract data from websites.

Think of your web browser as a highly sophisticated web scraper. When you type a URL, it sends a request to a server, gets back a bunch of HTML, CSS, and JavaScript, and then renders it into the beautiful page you see. Our Python web scraper will do the first two parts: send a request and get the raw HTML. Then, instead of rendering it, we’ll use special tools to pick out the data we want.

Why does this matter?

Data Collection: Gathering information for research, analysis, or building datasets.
Automation: Automating repetitive tasks like checking stock prices or news updates.
Monitoring: Keeping an eye on changes on a website.

Ethical Considerations: Be a Good Web Citizen! Before we dive into the code, it’s crucial to talk about ethics. Web scraping is powerful, and with great power comes great responsibility!

Check robots.txt: Most websites have a file called robots.txt (e.g., https://www.example.com/robots.txt). This file tells web crawlers (like our scraper) which parts of the site they are allowed or forbidden to access. Always respect these rules.
Terms of Service: Many websites have terms of service that explicitly prohibit scraping. Always check if possible.
Don’t Overload Servers: Make requests slowly. Sending too many requests too quickly can put a heavy load on a website’s server, potentially disrupting service for others. We can introduce delays in our code.
Don’t Abuse Data: Only scrape publicly available data. Never scrape private or sensitive information.
Identify Yourself: Sometimes, setting a user-agent header in your request can be a polite way to identify your scraper.

For this tutorial, we’ll use http://quotes.toscrape.com/, which is a website specifically designed for learning web scraping, so we can scrape it freely!

Introducing `requests`: Your Python Mailman for the Internet

When your browser wants to visit a website, it sends a message (an HTTP request) to the website’s server. The server then sends back a reply (an HTTP response) which contains the web page’s content. The requests library in Python makes sending these requests incredibly simple.

What requests does: It allows your Python program to send all sorts of HTTP requests (GET, POST, PUT, DELETE, etc.) and handle the responses. For web scraping, we’ll mostly be using GET requests to retrieve web pages.

Why is it important? It’s the very first step! You can’t parse a web page if you haven’t fetched it first.

How it works (simplified): You give requests a URL, and it goes out to the internet, fetches the content at that URL, and brings it back to your Python program.

import requests # We'll import this
response = requests.get("https://www.example.com") # Send a GET request
print(response.status_code) # Check if it worked (200 is good!)
print(response.text) # The actual content of the page

Introducing `Beautiful Soup`: Your Python HTML Interpreter

Once requests brings back the raw HTML text of a web page, it’s just a long string of characters. Trying to find specific pieces of information in that string would be like finding a needle in a haystack! This is where Beautiful Soup comes in.

What Beautiful Soup does: It takes that messy HTML string and transforms it into a Python object that you can easily navigate and search. It understands the structure of HTML (tags, attributes, nested elements) and provides simple ways to find what you’re looking for.

Why is it important? It turns raw, unstructured text into structured, searchable data. Without it, finding specific data within a web page would be a nightmare.

How it works (simplified): You feed Beautiful Soup the raw HTML content, and it creates a “soup” object. This object represents the HTML document as a tree structure, allowing you to move from element to element, search by tag name, class, ID, or other attributes.

from bs4 import BeautifulSoup # We'll import this
html_doc = "<html><head><title>My Page</title></head><body>Hello!</body></html>"
soup = BeautifulSoup(html_doc, 'html.parser') # Create the soup object
print(soup.title.text) # Easily get the title's text

We’ll use 'html.parser' as the second argument, which tells Beautiful Soup to use Python’s built-in HTML parser. It’s generally fast and reliable for most cases.

Step-by-Step Implementation: Building Our Scraper

Alright, let’s get our hands dirty and build our scraper piece by piece.

1. Setup Your Environment

First things first, let’s set up our project. It’s always good practice to use a virtual environment to keep your project’s dependencies separate from your system’s Python packages.

Create a Project Directory: Open your terminal or command prompt and create a new folder for our project:
```
mkdir simple_scraper
cd simple_scraper
```
Create a Virtual Environment: Now, let’s create a virtual environment. As of December 2025, the latest stable Python version is Python 3.14.1. We recommend using this version.
```
python3.14 -m venv .venv
```
(If python3.14 doesn’t work, try python -m venv .venv and ensure your default python is 3.14.1 or higher. You can check with python --version.)
Activate the Virtual Environment:
- On macOS/Linux:
```
source .venv/bin/activate
```
- On Windows (Command Prompt):
```
.venv\Scripts\activate
```
- On Windows (PowerShell):
```
.venv\Scripts\Activate.ps1
```
You should see (.venv) at the beginning of your terminal prompt, indicating the virtual environment is active.
Install Libraries: Now that our environment is active, let’s install requests and beautifulsoup4 (the package name for Beautiful Soup).
```
pip install requests beautifulsoup4
```
You can verify the installations by running pip list. You should see requests and beautifulsoup4 listed.
- requests official documentation: https://requests.readthedocs.io/en/latest/
- Beautiful Soup official documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

2. Create Your Scraper File

Inside your simple_scraper directory, create a new Python file named scraper.py.

Step 1: Fetching a Web Page with `requests`

Let’s start by getting the content of our target website: http://quotes.toscrape.com/.

Open scraper.py and add the following lines:

# scraper.py

import requests

# The URL of the website we want to scrape
URL = "http://quotes.toscrape.com/"

# Send a GET request to the URL
print(f"Attempting to fetch {URL}...")
response = requests.get(URL)

# Check if the request was successful (status code 200 means OK)
if response.status_code == 200:
    print("Successfully fetched the page!")
    # Print the first 500 characters of the raw HTML content
    print("\n--- Raw HTML Snippet ---")
    print(response.text[:500])
    print("------------------------")
else:
    print(f"Failed to fetch page. Status code: {response.status_code}")

Explanation:

import requests: This line brings the requests library into our script, making its functions available.
URL = "http://quotes.toscrape.com/": We define a variable URL to store the address of the website. This makes our code cleaner and easier to modify later.
response = requests.get(URL): This is the magic line! We call the get() function from the requests library, passing our URL. requests then goes to that address, fetches the content, and stores the entire response (including content, status code, headers, etc.) in the response variable.
if response.status_code == 200:: The response object has a status_code attribute. A 200 means “OK” – the request was successful and the server returned the page. Other common codes are 404 (Not Found) or 500 (Internal Server Error).
print(response.text[:500]): If successful, response.text contains the entire HTML content of the page as a single string. We print only the first 500 characters to get a peek without overwhelming our console.

Run this code: Save scraper.py and run it from your terminal (with your virtual environment active):

python scraper.py

You should see output similar to this (the raw HTML will vary slightly):

Attempting to fetch http://quotes.toscrape.com/...
Successfully fetched the page!

--- Raw HTML Snippet ---
<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
	<link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <h1
------------------------

Fantastic! You’ve just successfully fetched your first web page with Python!

Step 2: Parsing HTML with Beautiful Soup

Now that we have the raw HTML, let’s use Beautiful Soup to make it navigable.

Modify your scraper.py file. We’ll add the BeautifulSoup import and create our soup object inside the if response.status_code == 200: block.

# scraper.py

import requests
from bs4 import BeautifulSoup # Add this line

URL = "http://quotes.toscrape.com/"

print(f"Attempting to fetch {URL}...")
response = requests.get(URL)

if response.status_code == 200:
    print("Successfully fetched the page!")
    # No need to print raw HTML snippet again, we know it works

    # Create a Beautiful Soup object from the HTML content
    print("Parsing HTML with Beautiful Soup...")
    soup = BeautifulSoup(response.text, 'html.parser')

    # Print the prettified HTML to see its structure clearly
    print("\n--- Prettified HTML Snippet (first 500 chars) ---")
    print(soup.prettify()[:500])
    print("-------------------------------------------------")

else:
    print(f"Failed to fetch page. Status code: {response.status_code}")

Explanation:

from bs4 import BeautifulSoup: This imports the BeautifulSoup class from the bs4 library.
soup = BeautifulSoup(response.text, 'html.parser'): This is where the magic happens! We create an instance of BeautifulSoup.
- The first argument, response.text, is the raw HTML string we got from requests.
- The second argument, 'html.parser', specifies which parser Beautiful Soup should use to understand the HTML structure. Python’s built-in html.parser is generally good for most cases.
print(soup.prettify()[:500]): The prettify() method indents the HTML, making it much easier to read and understand its structure. We’re still only showing the first 500 characters for brevity.

Run this code:

python scraper.py

You’ll now see a nicely indented HTML structure, which is much more readable than the raw text. This structured representation is what Beautiful Soup uses internally, making it easy for us to find elements.

Step 3: Finding Elements – The Page Title

Let’s try to extract something simple first: the page title. If you look at the HTML snippet (or inspect the page in your browser’s developer tools), you’ll see the title is inside a <title> tag.

Add these lines after creating the soup object:

# scraper.py

import requests
from bs4 import BeautifulSoup

URL = "http://quotes.toscrape.com/"

print(f"Attempting to fetch {URL}...")
response = requests.get(URL)

if response.status_code == 200:
    print("Successfully fetched the page!")
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the <title> tag
    print("\n--- Extracting Page Title ---")
    title_tag = soup.find('title') # Use .find() to get the first matching tag

    if title_tag: # Check if the title tag was found
        page_title = title_tag.text # Get the text content of the tag
        print(f"Page Title: {page_title}")
    else:
        print("Title tag not found.")
    print("-----------------------------")

else:
    print(f"Failed to fetch page. Status code: {response.status_code}")

Explanation:

title_tag = soup.find('title'): The find() method is one of Beautiful Soup’s most powerful tools. You pass it the name of the HTML tag you’re looking for (e.g., 'title', 'p', 'a', 'div'). It returns the first element that matches, or None if no such element is found.
if title_tag:: It’s good practice to check if find() actually returned an element before trying to access its properties. If find() returns None, trying None.text would cause an error.
page_title = title_tag.text: Once we have an element (like title_tag), we can access its text content using the .text attribute.

Run this code:

python scraper.py

You should now see:

--- Extracting Page Title ---
Page Title: Quotes to Scrape
-----------------------------

Excellent! You’ve extracted specific data from a web page!

Step 4: Finding Multiple Elements – All Quotes

Now for the main event: let’s extract all the quotes and their authors from the page. This will involve using find_all() and then iterating through the results.

To figure out how to find the quotes, we need to inspect the HTML structure of http://quotes.toscrape.com/. If you open the page in your browser and use your browser’s developer tools (usually F12), you can click on a quote and see its HTML. You’ll notice that each quote is typically within a div tag that has a specific class, like class="quote". Inside that div, the quote text is often in a span with class="text", and the author in a small tag with class="author".

Let’s modify scraper.py to find all these quotes:

# scraper.py

import requests
from bs4 import BeautifulSoup

URL = "http://quotes.toscrape.com/"

print(f"Attempting to fetch {URL}...")
response = requests.get(URL)

if response.status_code == 200:
    print("Successfully fetched the page!")
    soup = BeautifulSoup(response.text, 'html.parser')

    # (Optional: remove previous print statements for cleaner output if you wish)
    # print("\n--- Extracting Page Title ---")
    # title_tag = soup.find('title')
    # if title_tag:
    #     print(f"Page Title: {title_tag.text}")
    # else:
    #     print("Title tag not found.")
    # print("-----------------------------")

    print("\n--- Extracting Quotes and Authors ---")
    # Find all div elements with the class 'quote'
    quotes = soup.find_all('div', class_='quote') # Note the underscore after class!

    # Iterate through each found quote
    for quote in quotes:
        # Inside each quote div, find the span with class 'text'
        quote_text = quote.find('span', class_='text').text
        # Inside each quote div, find the small tag with class 'author'
        author = quote.find('small', class_='author').text

        print(f"Quote: {quote_text}")
        print(f"Author: {author}")
        print("-" * 30) # Separator for readability

else:
    print(f"Failed to fetch page. Status code: {response.status_code}")

Explanation:

quotes = soup.find_all('div', class_='quote'): This is the key line.
- find_all() is similar to find(), but instead of returning just the first match, it returns a list of all matching elements.
- We’re looking for div tags.
- class_='quote' is how we specify an HTML class attribute. Notice class_ with an underscore. This is because class is a reserved keyword in Python, so Beautiful Soup uses class_ to avoid conflicts.
for quote in quotes:: We loop through each quote element in the quotes list. Each quote itself is a Beautiful Soup object, representing one <div class="quote">...</div> block.
quote_text = quote.find('span', class_='text').text: Inside each quote element, we use find() again to locate the span tag with class="text". This demonstrates how you can chain find() calls to go deeper into the HTML structure. We then get its text content.
author = quote.find('small', class_='author').text: Similarly, we find the small tag with class="author" within the current quote element and extract its text.

Run this code:

python scraper.py

You should now see a list of quotes and their authors, neatly extracted from the web page!

--- Extracting Quotes and Authors ---
Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein
------------------------------
Quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Author: J.K. Rowling
------------------------------
# ... and so on for all quotes on the page ...

Congratulations! You’ve built a functional web scraper! This is a fundamental skill for anyone looking to work with data from the web.

Mini-Challenge: Extracting Tags

You’ve done an amazing job extracting quotes and authors. Now, for a small challenge to solidify your understanding:

Challenge: For each quote you extract, also extract all the tags associated with it. Each quote on quotes.toscrape.com has a section of tags at the bottom.

Hint:

Inspect the HTML for a single quote again using your browser’s developer tools.
Look for a div element with class="tags" within each quote block.
Inside that div, you’ll find multiple a (anchor) tags, each with class="tag".
You’ll need find_all() again to get all the tag a elements, and then loop through those to get their text.

What to observe/learn: This challenge will reinforce using find_all() and nested loops to extract data from different levels of the HTML structure.

Take your time, try to solve it on your own first! If you get stuck, don’t worry, the solution is below.

Click for Hint/Solution (if you're stuck!)

# ... (previous code remains the same up to the loop for quotes) ...

    print("\n--- Extracting Quotes, Authors, and Tags ---")
    quotes = soup.find_all('div', class_='quote')

    for quote in quotes:
        quote_text = quote.find('span', class_='text').text
        author = quote.find('small', class_='author').text

        # Find the div containing tags within the current quote
        tags_div = quote.find('div', class_='tags')
        
        # Initialize a list to hold the tags for this quote
        tags_list = []
        if tags_div: # Ensure the tags div exists
            # Find all <a> tags with class 'tag' within the tags_div
            tags = tags_div.find_all('a', class_='tag')
            for tag in tags:
                tags_list.append(tag.text)
        
        print(f"Quote: {quote_text}")
        print(f"Author: {author}")
        print(f"Tags: {', '.join(tags_list)}") # Join tags with a comma for nice output
        print("-" * 30)

# ... (rest of the code) ...

Common Pitfalls & Troubleshooting

Web scraping can sometimes be tricky. Here are a few common issues you might encounter and how to troubleshoot them:

requests.exceptions.ConnectionError (or similar network errors):
- What it means: Your program couldn’t reach the website.
- Causes:
  - Incorrect URL (typo, missing http:// or https://).
  - No internet connection.
  - The website is down or blocking your IP address.
  - Firewall issues.
- Troubleshooting:
  - Double-check the URL.
  - Try opening the URL in your browser to confirm it’s accessible.
  - Check your internet connection.
  - If repeatedly blocked, consider adding time.sleep(1) (import time) between requests to slow down your scraping, or use a VPN/proxy (advanced).
AttributeError: 'NoneType' object has no attribute 'text' (or 'find', etc.):
- What it means: Beautiful Soup’s find() method returned None because it couldn’t find the element you asked for. When you then try to access .text (or .find(), etc.) on that None object, Python throws an AttributeError.
- Causes:
  - The HTML structure of the website changed.
  - Your selector (tag name, class, ID) is incorrect.
  - The element simply doesn’t exist on the page you’re scraping.
- Troubleshooting:
  - Crucial Step: Open the website in your browser and use the developer tools (usually F12, then “Inspect Element”) to carefully examine the HTML structure.
  - Verify the tag names, class names, and IDs exactly. Remember class_ for class attributes in Beautiful Soup.
  - Check if the content you’re looking for is loaded dynamically by JavaScript (Beautiful Soup only sees the initial HTML). This is an advanced topic, usually requiring tools like Selenium.
Getting Blocked by the Website:
- What it means: The website detected your scraper and is preventing you from accessing its content.
- Causes:
  - Sending too many requests too quickly (rate limiting).
  - Not respecting robots.txt.
  - The website has anti-scraping measures in place.
- Troubleshooting:
  - Be polite! Add time.sleep(1) (or more) between your requests.
  - Check robots.txt for the domain.
  - Sometimes, adding a User-Agent header to your request can help. This makes your scraper look more like a regular browser:
```
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}
response = requests.get(URL, headers=headers)
```
    (Note: The user agent string should be updated to a current one if you use this in production. You can find current user agents by searching “what is my user agent” in your browser.)

Remember, patience and careful inspection of the target website’s HTML are your best friends when troubleshooting web scraping issues!

Summary

Phew! You’ve just completed a significant project! Let’s recap what you’ve learned in this chapter:

What Web Scraping Is: Using Python to automatically extract data from websites.
Ethical Considerations: The importance of respecting robots.txt, terms of service, and not overloading servers.
requests Library:
- Used to send HTTP GET requests to fetch web page content.
- requests.get(URL) fetches the page.
- response.status_code checks if the request was successful (200 is good!).
- response.text gives you the raw HTML content as a string.
Beautiful Soup Library:
- Used to parse raw HTML into a navigable Python object.
- BeautifulSoup(html_content, 'html.parser') creates the “soup” object.
- soup.find('tag_name') finds the first matching HTML tag.
- soup.find_all('tag_name', class_='class_name') finds all matching HTML tags with a specific class. Remember class_!
- .text attribute extracts the text content from an element.
- Chaining find() and find_all() allows you to navigate nested HTML structures.
Practical Application: You built a scraper to extract quotes, authors, and tags from a live website.
Troubleshooting: Learned about common errors like ConnectionError and AttributeError and how to debug them.

This project lays a solid foundation for more advanced data extraction tasks. You now have a powerful tool in your Python toolkit!

What’s Next? In future chapters, we might explore:

Storing your scraped data into structured formats like CSV files or databases.
Handling pagination (scraping data from multiple pages).
Dealing with more complex HTML structures or JavaScript-rendered content.

For now, give yourself a pat on the back! You’ve officially entered the world of practical web automation. Keep experimenting, keep building, and happy coding!

Project: Simple Web Scraper with Requests and Beautiful Soup

Table of Contents

Welcome to Chapter 16: Project: Simple Web Scraper!

What is Web Scraping? Your Digital Assistant for the Web

Introducing requests: Your Python Mailman for the Internet

Introducing Beautiful Soup: Your Python HTML Interpreter

Step-by-Step Implementation: Building Our Scraper

1. Setup Your Environment

2. Create Your Scraper File

Step 1: Fetching a Web Page with requests

Step 2: Parsing HTML with Beautiful Soup

Step 3: Finding Elements – The Page Title

Step 4: Finding Multiple Elements – All Quotes

Mini-Challenge: Extracting Tags

Common Pitfalls & Troubleshooting

Summary

Introducing `requests`: Your Python Mailman for the Internet

Introducing `Beautiful Soup`: Your Python HTML Interpreter

Step 1: Fetching a Web Page with `requests`