Python Automation & Web Scraping Project

Module 6: Automation, Scraping, and Applied Project

This is it! In our final module, we bring everything together to build practical, real-world tools. You'll learn how to make Python your personal assistant, teach it to browse and extract information from the web, and cap it all off by refactoring a real-world project.

Day 26: Your Personal Robot Assistant (Automation)
Day 27: Teaching Your Program to Read Websites (Web Scraping)
Day 28: Handling Modern Websites (Selenium)
Day 29: Building Polished Command-Line Tools
Day 30: Your Graduation Project: Refactoring a Real Script

Quick Recap of Module 5

In Module 5, we learned the practices of a professional software developer:

Object-Oriented Programming (OOP): We learned how to create blueprints for our data using class, bundling data (attributes) and actions (methods) into objects.
Inheritance & Composition: We explored how to build specialized classes from general ones ("is-a" relationship) and how to construct complex objects from simpler ones ("has-a" relationship).
Unit Testing: We wrote automated tests using the unittest module to create a safety net for our code, allowing us to make changes with confidence.
Debugging & Logging: We learned how to find and fix bugs using Python's debugger (pdb) and how to record our program's activities with the logging module.
Refactoring: We learned the art of cleaning up our code to make it more readable and maintainable without changing what it does.

Day 26: Your Personal Robot Assistant (Automation)

Objectives

Write scripts that handle common, repetitive tasks.
Use tools like cron or schedule to run your scripts automatically.
Get comfortable with the pathlib and datetime modules for file system tasks.

One of the most satisfying uses of Python is automating the boring stuff. Think about tasks you do over and over: renaming files, organizing downloads, creating backups. A simple Python script can do these things for you in seconds. Key to this are the pathlib and datetime modules. pathlib provides a clean, modern way to work with files and folders, while datetime lets you easily handle dates and times, like checking how old a file is.

The real magic happens when you schedule these scripts to run automatically. On Mac and Linux, a built-in tool called cron is the standard way to run a command at a specific time (e.g., every day at midnight). For a pure Python solution that's cross-platform, you can install a simple third-party library like schedule. This allows you to write, in plain English, rules like schedule.every().monday.at("08:00").do(my_backup_job). With scheduling, you build a true "set it and forget it" solution.

🤯 Automation and Scheduling... what's the simple idea?

Automation is like teaching a robot to do your chores. 🤖

You write a script (the robot's instructions) for a boring task, like "Go through my downloads folder and move all the images to the 'Pictures' folder."
You only have to write these instructions once.

Scheduling is like putting that robot on a timer. ⏰

You tell the robot, "Every night at 11 PM, run the 'organize downloads' program."
Now the chore gets done automatically, forever, without you ever having to think about it again.

Practice ✍️

Write a script that creates a backup of a specific file. It should read the contents of a source file (e.g., notes.txt) and write them to a new file with a timestamp in the name, like notes_2025-08-29.txt.

Click to view a sample automation script

# simple_backup.py
import datetime

source_file = 'notes.txt'
# Create a dummy source file
with open(source_file, 'w') as f:
    f.write("This is an important note.")

# Get today's date to use in the filename
today_str = datetime.date.today().isoformat() # Format: YYYY-MM-DD
backup_filename = f"{source_file.split('.')[0]}_{today_str}.txt"

try:
    with open(source_file, 'r') as f_in:
        content = f_in.read()
    
    with open(backup_filename, 'w') as f_out:
        f_out.write(content)
        
    print(f"Backup successful! Created {backup_filename}")

except FileNotFoundError:
    print(f"Error: Source file '{source_file}' not found.")

Day 27: Teaching Your Program to Read Websites (Web Scraping)

Objectives

Download the content of a web page using the `requests` library.
Parse messy HTML to find the data you need with `BeautifulSoup`.
Understand the ethics of web scraping.

What if the data you need isn't in a nice file, but on a website? Web scraping is the process of writing a program to automatically extract that information. It's a two-step process. First, you need to download the page's raw HTML code. The requests library is the gold standard for this; it makes fetching a web page as simple as requests.get(url).

Second, you need to parse that messy HTML to find the specific pieces of information you want. The BeautifulSoup library is a fantastic tool for this. It turns the HTML into a structured object that you can easily navigate. You can ask it to find all the <h2> tags, or find the element with a specific CSS class, making it simple to pull out things like article titles, prices, or table data.

A quick note on ethics: always be a polite scraper! Check a website's `robots.txt` file to see what they allow, don't hammer their server with too many requests too quickly, and never scrape personal data without permission.

🤯 This sounds complicated. How does it actually work?

Web scraping is like sending a research assistant to the library for you. 📚

You tell the assistant which book to get. This is requests.get(url). The assistant brings the entire book (the raw HTML content) back to you.

The book is huge and dense. You don't want to read the whole thing. So, you give your assistant a pair of magic glasses (BeautifulSoup) and a command: "Use these glasses to find and write down all the chapter titles."

The requests library fetches the raw material, and BeautifulSoup is the specialized tool that helps you sift through that material to find exactly what you're looking for.

Practice ✍️

Find a simple, static news website or blog that you like. Write a script that uses requests and BeautifulSoup to download the homepage and print out a list of all the article headlines you can find (they are often in <h2> or <h3> tags).

Click to view a sample scraping script

# scrape_headlines.py
import requests
from bs4 import BeautifulSoup

# This URL is a special page designed for scraping practice
URL = "http://quotes.toscrape.com/"

try:
    response = requests.get(URL)
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    soup = BeautifulSoup(response.text, 'html.parser')
    
    # On this site, the quotes are in a span with class="text"
    quotes = soup.find_all('span', class_='text')

    print("--- Found Quotes ---")
    for quote in quotes:
        print(f"- {quote.text}")

except requests.exceptions.RequestException as e:
    print(f"Error fetching the URL: {e}")

Day 28: Handling Modern Websites (Selenium)

Objectives

Know when `requests` isn't enough and you need a browser automation tool.
Control a real web browser using `Selenium WebDriver`.
Handle pages where content appears dynamically after you click or scroll.

Have you ever visited a website where the content appears a second or two *after* the page loads? This is because of JavaScript. The requests library can't handle this; it only downloads the initial HTML. It can't run the JavaScript, so it can't see the final content. For these modern, dynamic websites, you need a more powerful tool: Selenium.

Selenium doesn't just download a page; it automates an actual web browser (like Chrome or Firefox). Your Python script can tell the browser to go to a URL, click buttons, fill out forms, and scroll down. Because it's a real browser, all the JavaScript runs just like it would for a human user. This allows your script to see the final, fully-rendered page.

The key to using Selenium effectively is learning to make your script wait. You can tell it to wait up to 10 seconds for a specific button to become clickable or for a piece of data to appear on the page. This makes your scripts robust and able to handle pages that take a few seconds to load their content.

🤯 `requests` vs. `Selenium`? When do I use which?

requests is like getting a newspaper delivered. Selenium is like sending a robot to a coffee shop. ☕

With the newspaper (requests), you get a static snapshot of the information as it was printed. It's fast and simple, but you can't interact with it. It's perfect for simple, static websites.

With the robot (Selenium), you can send it into a dynamic environment. The robot can read the ever-changing digital menu board (JavaScript content), press the buttons on the ordering kiosk (click elements), and wait for its order number to be called (wait for dynamic data). It's slower, but it's essential for modern, interactive websites.

Practice ✍️

Go to a simple e-commerce website with a search bar. Write a Selenium script that navigates to the homepage, finds the search input field, types "python book" into it, and clicks the search button.

Click to view a sample Selenium script

# selenium_search.py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

# NOTE: You need to have geckodriver (for Firefox) or chromedriver (for Chrome) installed.
driver = webdriver.Firefox()

try:
    # This example uses books.toscrape.com, which doesn't have a search bar.
    # We will simulate a search on a site like Wikipedia instead.
    driver.get("https://en.wikipedia.org/wiki/Main_Page")
    
    # Find the search input element by its ID
    search_box = driver.find_element(By.ID, "searchInput")
    
    # Type text into the search box and press Enter
    search_box.send_keys("Python programming")
    search_box.send_keys(Keys.RETURN)
    
    print("Search submitted! Waiting for results...")
    # Give the page a moment to load
    time.sleep(5) 
    
    print(f"Current page title is: {driver.title}")

finally:
    # Always close the browser window
    driver.quit()

Day 29: Building Polished Command-Line Tools

Objectives

Combine your skills to create a genuinely useful command-line tool.
Use `argparse` to create a clean interface with options and commands.
Focus on creating a good user experience.

It's time to graduate from simple scripts to polished, professional-feeling Command-Line Interface (CLI) tools. A good CLI tool is more than just a script; it's a program that's easy for others to use, with clear instructions and flexible options. We'll use the argparse module we learned about earlier to define a robust interface for a new tool.

The goal is to combine the skills from this module. Your tool might take a file path as an argument (Day 16), parse it as a CSV (Day 17), and then use that data to scrape a website for more information (Day 27). argparse lets you define required inputs, optional flags (like --verbose), and even sub-commands (like git pull or git commit), which allow your tool to perform different actions.

Focus on the user experience. A good tool provides a helpful --help message, gives clear feedback as it runs, and formats its output neatly. Building a solid CLI is a fantastic way to package your automation or scraping scripts into a reusable and shareable tool.

🤯 Isn't this the same as Day 19? What's new?

This is like graduating from designing a single button to designing the entire dashboard of a car. 🚗

On Day 19, we learned the basics of argparse—how to add a single argument or flag. Today, we're thinking bigger. We're combining that skill with everything else we've learned to build a complete, multi-part tool.

The new idea is about design and integration:

How do you design an interface with multiple commands (sub-parsers)?
How do you connect a command-line flag (--output-format json) to the part of your code that generates a report (Day 20)?
How do you make your file automation script (Day 26) configurable with command-line options?

It's about moving from a simple script to a well-designed, user-friendly program.

Practice ✍️

Build a simple CLI tool that scrapes the main headline from a news website. It should take one required argument: the name of the site (e.g., 'bbc', 'reuters'). Use a dictionary in your code to map the name to the actual URL and the correct CSS selector for the headline. Your tool should then print the headline.

Click to see a sample expenses tool

# headline_scraper_cli.py
import argparse
import requests
from bs4 import BeautifulSoup

# A dictionary to hold the configuration for each site
SITE_CONFIG = {
    'toscrape': {
        'url': 'http://quotes.toscrape.com/',
        'selector': 'span.text'
    }
    # In a real tool, you would add more sites here
    # 'bbc': { 'url': 'https://www.bbc.com/news', 'selector': 'h3' }
}

parser = argparse.ArgumentParser(description="Scrape the main headline/quote from a website.")
parser.add_argument("site", choices=SITE_CONFIG.keys(), help="The short name of the site to scrape.")
args = parser.parse_args()

config = SITE_CONFIG[args.site]
url = config['url']
selector = config['selector']

try:
    response = requests.get(url)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find the FIRST matching element
    headline = soup.select_one(selector)
    
    if headline:
        print(headline.text.strip())
    else:
        print(f"Could not find the headline using selector: {selector}")
except requests.RequestException as e:
    print(f"Error fetching URL: {e}")

Day 30: Your Graduation Project: Refactoring a Real Script

Objectives

Read and understand a moderately complex, real-world script.
Improve existing code by adding features like command-line options and logging.
Put all your skills together in a final capstone project.

Congratulations, you've made it to the final day! For your capstone project, you won't be writing a script from scratch. Instead, you'll be doing something far more common in the life of a developer: taking an existing, working script and making it better. You'll be working with the real script I wrote for Whatchan to scrape and display football listings.

Your task is to refactor this script. It already works, but it's rigid. It has hard-coded values and doesn't give the user much control or feedback. Your job is to improve it by:

Adding command-line arguments with argparse to control its behavior (e.g., run headlessly, change the output directory).
Adding logging to provide clear feedback about what the script is doing and to report errors gracefully.
Making it more flexible and robust, just like a professional tool.

This project will test your ability to read someone else's code, understand its logic, and carefully modify it without breaking it. This is the ultimate test of your new skills and will leave you with a powerful, practical tool that you've made your own. Good luck!

🤯 This sounds scary! What's the main idea?

Think of this as being a custom car mechanic. 🛠️

A customer brings you a perfectly functional car (the original script). It runs, but it's basic.

Your job isn't to build a new car from scratch. Your job is to upgrade the existing one:

You're going to install a new dashboard with more controls (argparse).
You're going to add a diagnostic system that reports on the engine's health (logging).
You're going to make some parts more efficient and flexible (general refactoring).

You're taking something that works and using your skills to turn it into something that's powerful, flexible, and professional. It's a hugely valuable real-world skill.

Practice ✍️

Find the whatchan_amended.py script from the course materials. First, just read through it and try to understand how it works. Run it and see the output. Then, start your refactoring. Add a --headless flag, an --output-dir argument, and a --no-images flag. Sprinkle logging.info() and logging.error() messages throughout the code to track its progress. You've got this!

Click to see a sample refactor approach

# In your refactored whatchan.py, the top would look something like this:
import argparse
import logging
from pathlib import Path
# ... other imports

# 1. Set up Logging
logging.basicConfig(
    level=logging.INFO, 
    format='%(asctime)s - %(levelname)s - %(message)s'
)

# 2. Set up Argparse
parser = argparse.ArgumentParser(description="Scrape football listings from Whatchan.")
parser.add_argument('--headless', action='store_true', help='Run browser in headless mode.')
parser.add_argument('--output-dir', type=Path, default=Path('output'), help='Directory to save output files.')
parser.add_argument('--no-images', action='store_true', help='Do not download channel images.')
args = parser.parse_args()

# 3. Use the arguments in your main function
def main():
    logging.info("Starting the scraper...")
    # Use args.headless to configure selenium
    # Use args.output_dir to define save paths
    # Use if not args.no_images: to control image downloads
    # ...
    logging.info("Scraping complete.")

if __name__ == '__main__':
    main()

Further resources

Automate the Boring Stuff with Python – A classic book on practical automation.
Real Python's Web Scraping Tutorial – A great hands-on guide.
Selenium with Python Documentation – The official guide for browser automation.

Adrian Dane

Module 6: Automation, Scraping, and Applied Project

Contents

Day 26: Your Personal Robot Assistant (Automation)

Objectives

Practice ✍️

Day 27: Teaching Your Program to Read Websites (Web Scraping)

Objectives

Practice ✍️

Day 28: Handling Modern Websites (Selenium)

Objectives

Practice ✍️

Day 29: Building Polished Command-Line Tools

Objectives

Practice ✍️

Day 30: Your Graduation Project: Refactoring a Real Script

Objectives

Practice ✍️

Further resources