Scraping Websites for Email Extraction Using PHP

Introduction

In the previous blog, we covered the basics of extracting emails from text using PHP and MySQL. Now, we’ll take it a step further by learning how to scrape websites for email addresses. Web scraping is a powerful technique used to extract data from websites, and with the right approach, you can gather email addresses from web pages effectively.

In this blog, we will walk you through setting up a basic web scraper using PHP to extract emails from a given website and store them in a MySQL database.

1. What Is Web Scraping?

Web scraping involves using a program to automatically extract data from websites. It’s useful for tasks like gathering contact information, tracking prices, or collecting large datasets. Email extraction is one of the common use cases for web scraping.

2. Legal Considerations for Scraping

Before you begin, it’s important to understand the legal aspects of web scraping. Many websites have terms of service that restrict scraping. Be sure to:

Check the Website’s Terms: Make sure you have permission to scrape.
Respect Robots.txt: This file tells scrapers what is and isn’t allowed to be scraped on a website.
Ethical Scraping: Avoid overwhelming a server with too many requests in a short time.

3. Tools Needed for Scraping with PHP

For scraping, you can use the following tools and libraries:

cURL: A popular PHP library used for making HTTP requests.
PHP DOMDocument: A library that allows you to parse HTML and XML documents.
MySQL: To store extracted emails.

4. Setting Up Your Environment

Make sure your environment is set up with PHP, cURL, and MySQL. You can use WAMP or XAMPP as your local server.

Install WAMP/XAMPP.
Verify that cURL is enabled in your PHP configuration (php.ini).
Create a MySQL database email_extractor with the emails table as we did in the previous blog.

CREATE TABLE emails (
    id INT AUTO_INCREMENT PRIMARY KEY,
    email_address VARCHAR(255) NOT NULL,
    source VARCHAR(255)
);

5. Writing the Scraper Using cURL

cURL is a powerful PHP library that allows you to make HTTP requests to websites. Here’s how to use it to get the HTML of a webpage:

<?php
// URL of the website to scrape
$url = "https://example.com";

// Initialize cURL session
$ch = curl_init($url);

// Set cURL options
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

// Execute cURL and store the result
$html = curl_exec($ch);

// Close cURL session
curl_close($ch);

// Display the HTML content
echo $html;
?>

6. Parsing the HTML with PHP DOMDocument

Once you have the HTML content of the page, you need to extract the email addresses from it. PHP’s DOMDocument can be used to parse the HTML.

<?php
// Load HTML content into DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// Get all anchor tags
$links = $dom->getElementsByTagName('a');

foreach ($links as $link) {
    // Check if the link contains 'mailto:' (common for email links)
    $href = $link->getAttribute('href');
    if (strpos($href, 'mailto:') !== false) {
        $email = str_replace('mailto:', '', $href);
        echo "Found email: $email\n";
    }
}
?>

This script will find any mailto: links in the page and extract the email addresses.

7. Extracting Emails Using Regex

Not all email addresses are in mailto: links. Some are embedded in the text, so we can use regular expressions to capture them.

<?php
$pattern = '/[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}\b/i';
preg_match_all($pattern, $html, $matches);

foreach ($matches[0] as $email) {
    echo "Found email: $email\n";
}
?>

This will scan the entire HTML content for any email patterns and display them.

8. Storing Extracted Emails in MySQL

Once you’ve extracted the emails, you can store them in the MySQL database as shown in the previous blog. Here’s a quick recap of how to insert the emails into the emails table.

<?php
$servername = "localhost";
$username = "root";
$password = "";
$dbname = "email_extractor";

// Create connection
$conn = new mysqli($servername, $username, $password, $dbname);

// Check connection
if ($conn->connect_error) {
    die("Connection failed: " . $conn->connect_error);
}

foreach ($matches[0] as $email) {
    $sql = "INSERT INTO emails (email_address, source) VALUES ('$email', '$url')";
    $conn->query($sql);
}

$conn->close();
?>

This will save all the found email addresses into the emails table with the source URL.

9. Handling Multiple Pages

Many websites have multiple pages of content. To scrape them all, you can loop through the pages by dynamically modifying the URL or following pagination links.

Example:

for ($i = 1; $i <= 5; $i++) {
    $url = "https://example.com/page=$i";
    // Fetch and process each page
}

Conclusion

In this blog, we covered the process of scraping websites for email extraction using PHP. We learned how to:

Use cURL to fetch webpage HTML.
Parse HTML using PHP DOMDocument.
Extract emails using mailto: links and regex.
Store the results in a MySQL database.