Scraping Websites for Email Extraction Using PHP
Introduction
In the previous blog, we covered the basics of extracting emails from text using PHP and MySQL. Now, we’ll take it a step further by learning how to scrape websites for email addresses. Web scraping is a powerful technique used to extract data from websites, and with the right approach, you can gather email addresses from web pages effectively.
In this blog, we will walk you through setting up a basic web scraper using PHP to extract emails from a given website and store them in a MySQL database.
1. What Is Web Scraping?
Web scraping involves using a program to automatically extract data from websites. It’s useful for tasks like gathering contact information, tracking prices, or collecting large datasets. Email extraction is one of the common use cases for web scraping.
2. Legal Considerations for Scraping
Before you begin, it’s important to understand the legal aspects of web scraping. Many websites have terms of service that restrict scraping. Be sure to:
- Check the Website’s Terms: Make sure you have permission to scrape.
- Respect Robots.txt: This file tells scrapers what is and isn’t allowed to be scraped on a website.
- Ethical Scraping: Avoid overwhelming a server with too many requests in a short time.
3. Tools Needed for Scraping with PHP
For scraping, you can use the following tools and libraries:
- cURL: A popular PHP library used for making HTTP requests.
- PHP DOMDocument: A library that allows you to parse HTML and XML documents.
- MySQL: To store extracted emails.
4. Setting Up Your Environment
Make sure your environment is set up with PHP, cURL, and MySQL. You can use WAMP or XAMPP as your local server.
- Install WAMP/XAMPP.
- Verify that cURL is enabled in your PHP configuration (
php.ini
). - Create a MySQL database
email_extractor
with theemails
table as we did in the previous blog.
CREATE TABLE emails (
id INT AUTO_INCREMENT PRIMARY KEY,
email_address VARCHAR(255) NOT NULL,
source VARCHAR(255)
);
5. Writing the Scraper Using cURL
cURL is a powerful PHP library that allows you to make HTTP requests to websites. Here’s how to use it to get the HTML of a webpage:
<?php
// URL of the website to scrape
$url = "https://example.com";
// Initialize cURL session
$ch = curl_init($url);
// Set cURL options
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Execute cURL and store the result
$html = curl_exec($ch);
// Close cURL session
curl_close($ch);
// Display the HTML content
echo $html;
?>
6. Parsing the HTML with PHP DOMDocument
Once you have the HTML content of the page, you need to extract the email addresses from it. PHP’s DOMDocument
can be used to parse the HTML.
<?php
// Load HTML content into DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);
// Get all anchor tags
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
// Check if the link contains 'mailto:' (common for email links)
$href = $link->getAttribute('href');
if (strpos($href, 'mailto:') !== false) {
$email = str_replace('mailto:', '', $href);
echo "Found email: $email\n";
}
}
?>
This script will find any mailto:
links in the page and extract the email addresses.
7. Extracting Emails Using Regex
Not all email addresses are in mailto:
links. Some are embedded in the text, so we can use regular expressions to capture them.
<?php
$pattern = '/[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}\b/i';
preg_match_all($pattern, $html, $matches);
foreach ($matches[0] as $email) {
echo "Found email: $email\n";
}
?>
This will scan the entire HTML content for any email patterns and display them.
8. Storing Extracted Emails in MySQL
Once you’ve extracted the emails, you can store them in the MySQL database as shown in the previous blog. Here’s a quick recap of how to insert the emails into the emails
table.
<?php
$servername = "localhost";
$username = "root";
$password = "";
$dbname = "email_extractor";
// Create connection
$conn = new mysqli($servername, $username, $password, $dbname);
// Check connection
if ($conn->connect_error) {
die("Connection failed: " . $conn->connect_error);
}
foreach ($matches[0] as $email) {
$sql = "INSERT INTO emails (email_address, source) VALUES ('$email', '$url')";
$conn->query($sql);
}
$conn->close();
?>
This will save all the found email addresses into the emails
table with the source URL.
9. Handling Multiple Pages
Many websites have multiple pages of content. To scrape them all, you can loop through the pages by dynamically modifying the URL or following pagination links.
Example:
for ($i = 1; $i <= 5; $i++) {
$url = "https://example.com/page=$i";
// Fetch and process each page
}
Conclusion
In this blog, we covered the process of scraping websites for email extraction using PHP. We learned how to:
- Use cURL to fetch webpage HTML.
- Parse HTML using PHP DOMDocument.
- Extract emails using
mailto:
links and regex. - Store the results in a MySQL database.