Exporting Web Scraped Data to CSV, JSON, and Databases

Python, Jan-09-20255 minutos de leitura

Web scraping has become an indispensable tool for gathering data from across the internet, empowering data analysts, tech enthusiasts, and businesses to make informed decisions. But extracting data is just the first step. To unlock its full potential, you need to export it efficiently into the right format—whether that's a CSV file for spreadsheets, JSON for APIs, or databases for large-scale storage and analysis.

This blog will take you through the essentials of exporting web-scraped data. You’ll learn step-by-step how to work with CSV and JSON files, integrate web-scraped data with databases, and make the most of your data management practices.

Pré-requisitos

Before diving into the script, let’s understand the dataset and workflow that we’ll use to demonstrate the data-saving process.

1. The Dataset

We’ll be scraping data from the website Books to Scrape, which provides a list of books along with their:

  • Title
  • Preço
  • Stock Availability

This website is designed for practice purposes, making it an ideal choice for showcasing web scraping techniques.

2. Workflow Overview

Here’s the process we’ll follow:

  • Scraping the Data: We’ll use Python’s requests e BeautifulSoup libraries to extract the book details from the website.
  • Organizing Data: The scraped data will be stored in a Pandas DataFrame. This tabular format makes it easy to manipulate and export the data.
  • Saving the Data: We’ll save the DataFrame in three formats:
    • CSV: (Comma-Separated Values)
    • JSON: (JavaScript Object Notation)
    • Database: (SQLite, a lightweight SQL database)

3. Installing Required Libraries

To run the script, you’ll need the following Python libraries:

  • requests: To fetch the webpage content.
  • beautifulsoup4: To parse the HTML structure of the webpage.
  • pandas: To organize the scraped data into a tabular format.

Install these libraries using pip. Run the following command in your terminal:

pip install requests beautifulsoup4 pandas

4. Python Script to Scrape and Save Data

Here’s the Python script to scrape the data from the website and store it in a Pandas DataFrame:


import requests
from bs4 import BeautifulSoup
import pandas as pd

# Scrape data from the website
def scrape_books():
    url = "https://books.toscrape.com/"
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception("Failed to load page")
    
    soup = BeautifulSoup(response.content, "html.parser")
    books = []
    
    # Extract book data
    for article in soup.find_all("article", class_="product_pod"):
        title = article.h3.a["title"]
        price = article.find("p", class_="price_color").text.strip()
        availability = article.find("p", class_="instock availability").text.strip()
        books.append({"Title": title, "Price": price, "Availability": availability})
    
    # Convert to DataFrame
    books_df = pd.DataFrame(books)
    return books_df

# Main execution
if __name__ == "__main__":
    print("Scraping data...")
    books_df = scrape_books()
    print("Data scraped successfully!")
    print(books_df)

The table we will use to demonstrate the data-saving process is structured as follows:

TitlePreçoDisponibilidade
A Light in the Attic£51.77 In stock
Tipping the Velvet£53.74In stock
Soumission£50.10In stock
Sharp Objects£47.82 In stock
Sapiens: A Brief History of Humankind£54.23NA
The Requiem Red £22.65In stock
..........

Exporting Web Scraped Data to CSV

Purpose of CSV

  • CSV (Comma-Separated Values) files are one of the simplest and most widely used formats for storing tabular data.
  • They are easy to read and write using Python and can be opened in spreadsheet tools like Microsoft Excel or Google Sheets.

Why Use CSV?

  • Ease of Use: CSV files are lightweight and portable.
  • Wide Compatibility: Most data analysis tools and platforms support CSV.
  • Simple Structure: No complex hierarchy, just rows and columns.

How to Save Data to CSV

Use the to_csv method from Pandas:

def save_to_csv(dataframe, filename="books.csv"):
    dataframe.to_csv(filename, index=False)
    print(f"Data saved to {filename}")

Code Explanation:

  • filename: Specifies the name of the output file.
  • index=False: Ensures the index column is not included in the CSV file.

Exporting Web Scraped Data to JSON

Purpose of JSON

  • JSON (JavaScript Object Notation) is a lightweight data-interchange format that is easy for both humans and machines to read and write.
  • It supports nested or hierarchical data structures, making it ideal for representing more complex datasets.

Why Use JSON?

  • APIs and Web Applications: JSON is the standard format for data exchange in APIs.
  • Human-Readable: Indented and well-structured JSON is easy to understand.
  • Supports Nested Data: Useful for more complex or relational data.

How to Save Data to JSON

Use the to_json method from Pandas:

def save_to_json(dataframe, filename="books.json"):
    dataframe.to_json(filename, orient="records", indent=4)
    print(f"Data saved to {filename}")

Code Explanation:

  • orient="records": : Each row in the DataFrame is converted into a JSON object.
  • indent=4: Formats the JSON for better readability.

Using Databases for Web Scraped Data

Purpose of a Database

  • Databases are designed to store large volumes of data efficiently and allow for complex queries.
  • SQLite, in particular, is a lightweight, file-based database that requires no server setup.

Why Use a Database?

  • Scalability: Handles larger datasets compared to CSV or JSON.
  • Querying Capabilities: Use SQL to filter, group, or sort data.
  • Data Integrity: Helps maintain structured and normalized data.

How to Save Data to a Database

Use the to_sql method from Pandas com SQLite:

import sqlite3

def save_to_database(dataframe, database_name="books.db"):
    conn = sqlite3.connect(database_name)
    dataframe.to_sql("books", conn, if_exists="replace", index=False)
    conn.close()
    print(f"Data saved to {database_name} database")

Code Explanation:

  • sqlite3.connect(database_name): Connects to the SQLite database (creates it if it doesn’t exist).
  • to_sql("books", conn, if_exists="replace", index=False):
    • Saves the DataFrame as a table named books
    • Replaces the table if it already exists.

Conclusão

While formats like CSV or JSON work well for smaller projects, databases offer superior performance, query optimization, and data integrity when handling larger datasets. The seamless integration of Pandas with SQLite makes it simple to store, retrieve, and manipulate data efficiently. Whether you're building a data pipeline or a complete application, understanding how to leverage databases will greatly enhance your ability to work with data effectively. Start using these tools today to streamline your data workflows and unlock new possibilities!