Web Scraping with PHP Programming Language

Guias, Como fazer, Raspagem, Dec-25-20245 minutos de leitura

Web scraping has become an essential tool for developers and data analysts who need to extract and analyze information from the web. Whether you're tracking product prices, collecting data for research, or building a customized dashboard, web scraping offers endless possibilities.

If you're a PHP enthusiast, Goutte is a fantastic library to consider for your web scraping needs. Goutte is lightweight, user-friendly, and powerful, combining Guzzle’s HTTP client capabilities with Symfony's DomCrawler for smooth and efficient web scraping.

This guide will take you through the basics of web scraping with PHP using Goutte—from installation and your first script to advanced techniques like form handling and pagination.

Why Choose Goutte for Web Scraping?

Goutte has gained popularity among developers for a number of reasons, making it one of the go-to scraping libraries for PHP:

  • Simple and Clean API: Goutte provides a clean and intuitive interface that is easy to learn, even for beginners.
  • Seamless Integration: It combines HTTP requests with HTML parsing, eliminating the need for separate libraries.
  • Powerful Features: Goutte supports advanced functionality like session handling, managing cookies, and submitting forms programmatically.
  • Beginner-Friendly, Yet Robust: From the simplest scraping tasks to more complex projects, it has everything you need to get started.

Whether you're new to PHP or a seasoned developer, Goutte strikes an ideal balance between simplicity and power.

Installing Goutte

Before jumping into coding, ensure the necessary prerequisites are in place:

  • PHP Installed: Make sure you have PHP 7.3 or higher installed on your development environment. Download PHP directly using this link.
  • Composer Installed: Composer is required to manage dependencies and install Goutte.

To install Goutte, simply run the following command in your terminal:

composer require fabpot/goutte

Once installed, verify the library is accessible by requiring Composer’s autoloader in your project:

require 'vendor/autoload.php';

Now you’re ready to start scraping!

Your First Web Scraping Script with Goutte

Let's begin with a simple example. We'll scrape the title of a webpage using Goutte. Below is the basic script:

Fetching and Displaying the Page Title

<?php
require 'vendor/autoload.php';

use Goutte\Client;

// Initialize Goutte Client
$client = new Client();

// Send a GET request to the target URL
$crawler = $client->request('GET', 'https://books.toscrape.com/');

// Extract the title of the page
$title = $crawler->filter('title')->text();
echo "Page Title: $title\n";

// Extract the titles of the first 5 books
echo "First 5 Book Titles:\n";
$crawler->filter('.product_pod h3 a')->slice(0, 5)->each(function ($node) {
    echo "- " . $node->attr('title') . "\n";
});
?>

Saída:

Page Title: All products | Books to Scrape - Sandbox
First 5 Book Titles:
- A Light in the Attic
- Tipping the Velvet
- Soumission
- Sharp Objects
- Sapiens: A Brief History of Humankind

It’s as easy as that! With just a few lines of code, you can fetch and display the títulotag of any webpage.

Extracting Data from Webpages

Once you've learned how to fetch a webpage, the next step is extracting specific data such as links or content from specific HTML elements.

Extracting All Links (`<a>` Tags)

The following script extracts the href attributes of all <a> tags on a webpage:

<?php
require 'vendor/autoload.php';

use Goutte\Client;

$client = new Client();
$crawler = $client->request('GET', 'https://books.toscrape.com/');

// Extract all <a> tags
$links = $crawler->filter('a')->each(function ($node) {
    return $node->attr('href');
});

// Print all extracted links
foreach ($links as $link) {
    echo $link . "\n";
}

This will return all the hyperlinks present on the page.

Extracting Content by Class or ID

Goutte makes it easy to extract or parse data from HTML using class or ID selectors. For this example, we’ll use the Books to Scrape website. Specifically, we’ll scrape information about each book, as they all share the same class, product_pod. Here’s how it appears on the website:

Here’s an example of how you can achieve this using Goutte:

<?php
require 'vendor/autoload.php';

use Goutte\Client;

$client = new Client();
$crawler = $client->request('GET', 'https://books.toscrape.com/');

// Extract all <a> tags
$links = $crawler->filter('a')->each(function ($node) {
    return $node->attr('href');
});

// Print all extracted links
foreach ($links as $link) {
    echo $link . "\n";
}

// Extract elements with class 'product_pod'
$products = $crawler->filter('.product_pod')->each(function ($node) {
    return $node->text();
});

// Print all extracted product details
foreach ($products as $product) {
    echo $product . "\n";
}

Navigating Between Pages

Now, let’s explore how to navigate or paginate between pages. In the example page we’re using, there’s a "Next" button that allows pagination to the next page. We’ll leverage this button to implement pagination.

First, we’ll locate the button using its class attribute that has as value seguinte . Within this element, there’s an <a> tag containing the URL for the next page. By extracting this URL, we can use it to send a new request and seamlessly move to the next page.Here is the appearance and HTML structure of the seguinte button on the page.

Here’s what the code that achieves this looks like:

<?php
require 'vendor/autoload.php';

use Goutte\Client;

$client = new Client();
$crawler = $client->request('GET', 'https://books.toscrape.com/');

// Handle pagination using the 'next' button
while ($crawler->filter('li.next a')->count() > 0) {
    $nextLink = $crawler->filter('li.next a')->attr('href');
    $crawler = $client->request('GET', 'https://books.toscrape.com/catalogue/' . $nextLink);
    
    // Extract and print the current page URL
    echo "Currently on: " . $crawler->getUri() . "\n";
}

With this approach, you can automate the navigation between pages and continue scraping data.

Handling Forms with Goutte

Goutte is also capable of handling forms. To demonstrate this functionality, we’ll use this website, which has a single input field, as shown in the image below:

Here’s what the code for submitting this form looks like:

<?php
require 'vendor/autoload.php';

use Goutte\Client;

$client = new Client();
$crawler = $client->request('GET', 'https://www.scrapethissite.com/pages/forms/');

// Submit the search form with a query
$form = $crawler->selectButton('Search')->form();
$form['q'] = 'Canada';

$crawler = $client->submit($form);

// Extract and print the results
$results = $crawler->filter('.team')->each(function ($node) {
    return $node->text();
});

foreach ($results as $result) {
    echo $result . "\n";
}

This script fills out a form field named q with the value web scraping and submits it. From here, you can extract content from the search results page just like in the earlier examples.

Error Handling and Best Practices

Handling Network Errors

Always add error handling to manage unexpected situations like a failed network connection or non-existent URLs.

<?php
require 'vendor/autoload.php';

use Goutte\Client;

$client = new Client();

try {
    $crawler = $client->request('GET', 'https://invalid-url-example.com');
    echo "Page title: " . $crawler->filter('title')->text();
} catch (Exception $e) {
    echo "An error occurred: " . $e->getMessage();
}
}

Respecting Robots.txt

Web scraping should always be performed ethically and responsibly. The `robots.txt` file is a simple text file used by websites to communicate with web crawlers, outlining which parts of the site can or cannot be accessed. Before scraping, it's important to check the `robots.txt` file to ensure you're following the site's rules and respecting their terms. Ignoring these guidelines can lead to legal and ethical issues, so always make this step a priority in your scraping process.

Read More about robots.txt here.

Rate Limiting

Be courteous and avoid sending too many requests in a short period of time, as this can overwhelm the server and disrupt its performance for other users. It's a good practice to include a short delay between each request to minimize the load on the server and ensure it can handle traffic efficiently. Taking these steps not only helps maintain server stability but also demonstrates responsible and considerate usage of shared resources.

sleep(1); // Wait 1 second between requests

Common Pitfalls

  • Many modern websites rely on JavaScript to load content, which means traditional scraping tools may not capture all the data you need. Tools like Puppeteer or Selenium can simulate user interactions and load content as a browser would.
  • Ensure the HTTPS endpoints you scrape show valid certificates to avoid errors. Invalid or expired certificates can cause your scraper to fail or raise security concerns. Always verify the certificate status before scraping and consider using libraries that handle these issues seamlessly.

Conclusão

Web scraping is a powerful tool for gathering data efficiently, but it requires a responsible and thoughtful approach to avoid common pitfalls and ensure ethical usage. By adhering to best practices such as respecting website terms of service, implementing appropriate delays between requests, and using tools capable of handling dynamic content, you can create a scraper that performs effectively while minimizing impact on servers. Additionally, verifying HTTPS certificates and staying mindful of security considerations will protect your scraper and any data it collects. With proper planning and execution, web scraping can become an invaluable resource for research, analysis, and innovation.