Building a Comprehensive Email Extraction Tool: Integrating All Techniques in PHP

Introduction

Welcome back to our email extraction series! Throughout our journey, we’ve covered various techniques for extracting emails from diverse content types, including static HTML pages, JavaScript-rendered content, and documents like PDFs and images. In this final installment, we will synthesize these techniques into a comprehensive email extraction tool using PHP and MySQL. This tool will empower you to efficiently extract email addresses from multiple input sources and store them systematically for further use.

Project Overview

Our objective is to create a PHP application that:

Accepts URLs for email extraction.
Identifies the content type (static HTML, JavaScript-rendered, PDF, etc.).
Extracts email addresses based on the content type.
Stores the extracted emails in a MySQL database for easy access and management.

By the end of this post, you will have a fully functional email extraction tool that can be further customized to suit your needs.

Setting Up the Environment

Before we dive into coding, ensure that you have the following set up in your development environment:

PHP: Make sure you have PHP installed on your local server. You can check this by running php -v in your terminal.
Composer: Composer is a dependency manager for PHP that helps us manage libraries easily. Install it by following the instructions on the Composer website.
MySQL: Set up a MySQL database to store the extracted emails. If you don’t have MySQL installed, consider using tools like XAMPP or MAMP, which bundle Apache, PHP, and MySQL together.
Selenium: If you plan to extract emails from JavaScript-rendered content, ensure you have Selenium WebDriver set up as discussed in our previous blog. This will allow us to automate browser actions.

Database Setup

To store the extracted email addresses, create a database and a corresponding table. Here’s how you can do this using SQL commands:

CREATE DATABASE email_extractor;
USE email_extractor;

CREATE TABLE emails (
    id INT AUTO_INCREMENT PRIMARY KEY,
    email_address VARCHAR(255) UNIQUE NOT NULL
);

This structure allows us to store unique email addresses, ensuring that duplicates are not recorded.

Building the Email Extraction Tool

1. Define the Directory Structure

Organizing your project files properly will help you manage and maintain the code efficiently. Here’s a recommended directory structure:

email_extractor/
├── composer.json
├── index.php
├── extractors/
│   ├── PdfExtractor.php
│   ├── HtmlExtractor.php
│   └── JsExtractor.php
└── db.php

2. Create Database Connection

Create a db.php file for MySQL connection to centralize database operations. Here’s a sample implementation:

<?php
$host = 'localhost'; // or your host
$username = 'your_username'; // your MySQL username
$password = 'your_password'; // your MySQL password
$dbname = 'email_extractor'; // your database name

$mysqli = new mysqli($host, $username, $password, $dbname);

if ($mysqli->connect_error) {
    die("Connection failed: " . $mysqli->connect_error);
}
?>

3. Create Extractors

In the extractors folder, we will create three classes, each responsible for extracting emails from a specific content type.

HtmlExtractor.php: Handles static HTML extraction.

<?php
class HtmlExtractor {
    public function extract($html) {
        // Regular expression to match email addresses
        preg_match_all("/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/", $html, $matches);
        return $matches[0]; // Returns an array of email addresses
    }
}
?>

This class utilizes a regular expression to find and return all email addresses in the provided HTML content.

JsExtractor.php: Handles JavaScript-rendered content extraction using Selenium.

<?php
require 'vendor/autoload.php'; // Load Composer's autoloader
use Facebook\WebDriver\Remote\RemoteWebDriver;
use Facebook\WebDriver\Remote\DesiredCapabilities;

class JsExtractor {
    private $driver;

    public function __construct() {
        $host = 'http://localhost:4444'; // Selenium Server URL
        $this->driver = RemoteWebDriver::create($host, DesiredCapabilities::chrome());
    }

    public function extract($url) {
        $this->driver->get($url); // Navigate to the URL
        $this->driver->wait()->until(
            WebDriverExpectedCondition::presenceOfElementLocated(WebDriverBy::cssSelector('body')) // Wait for body to load
        );
        $html = $this->driver->getPageSource(); // Get the page source
        $this->driver->quit(); // Close the browser

        // Extract email addresses from the HTML
        preg_match_all("/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/", $html, $matches);
        return $matches[0]; // Returns an array of email addresses
    }
}
?>

In this class, we initiate a Selenium WebDriver instance, navigate to the specified URL, wait for the page to load, and then extract the HTML content for email extraction.

PdfExtractor.php: Handles PDF email extraction.

<?php
require 'vendor/autoload.php'; // Load Composer's autoloader
use Smalot\PdfParser\Parser;

class PdfExtractor {
    public function extract($filePath) {
        $parser = new Parser(); // Initialize PDF parser
        $pdf = $parser->parseFile($filePath); // Parse the PDF file
        $text = $pdf->getText(); // Extract text from the PDF

        // Regular expression to match email addresses
        preg_match_all("/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/", $text, $matches);
        return $matches[0]; // Returns an array of email addresses
    }
}
?>

This class uses the Smalot/PdfParser library to extract text from PDF files, allowing us to find email addresses within.

4. Create the Main Extraction Logic in `index.php`

The index.php file will serve as the main interface for user input and processing. Here’s the complete implementation:

<?php
require 'db.php'; // Include database connection
require 'extractors/HtmlExtractor.php'; // Include HTML extractor
require 'extractors/JsExtractor.php'; // Include JS extractor
require 'extractors/PdfExtractor.php'; // Include PDF extractor

if ($_SERVER['REQUEST_METHOD'] === 'POST') {
    $url = $_POST['url'];
    $contentType = $_POST['content_type']; // Get the selected content type
    $emails = []; // Initialize an array to store extracted emails

    // Determine the appropriate extractor based on content type
    switch ($contentType) {
        case 'static_html':
            $html = file_get_contents($url); // Fetch HTML content
            $extractor = new HtmlExtractor(); // Create an instance of HtmlExtractor
            $emails = $extractor->extract($html); // Extract emails
            break;

        case 'js_rendered':
            $extractor = new JsExtractor(); // Create an instance of JsExtractor
            $emails = $extractor->extract($url); // Extract emails from JS-rendered content
            break;

        case 'pdf':
            // Assuming the PDF file is accessible via URL, we can download it first
            $tempFile = tempnam(sys_get_temp_dir(), 'pdf_'); // Create a temporary file
            file_put_contents($tempFile, file_get_contents($url)); // Download the PDF
            $extractor = new PdfExtractor(); // Create an instance of PdfExtractor
            $emails = $extractor->extract($tempFile); // Extract emails
            unlink($tempFile); // Delete the temporary file
            break;

        default:
            echo "Unsupported content type.";
            exit;
    }

    // Insert emails into database
    foreach ($emails as $email) {
        $stmt = $mysqli->prepare("INSERT IGNORE INTO emails (email_address) VALUES (?)"); // Prepare SQL statement
        $stmt->bind_param("s", $email); // Bind the email parameter
        $stmt->execute(); // Execute the statement
    }

    echo "Extracted emails: " . implode(", ", $emails); // Display extracted emails
}
?>

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Email Extractor</title>
    <style>
        body { font-family: Arial, sans-serif; }
        form { margin: 20px; }
        input, select { margin-bottom: 10px; }
    </style>
</head>
<body>
    <h1>Email Extraction Tool</h1>
    <form method="post" action="">
        <label for="url">Enter URL:</label>
        <input type="text" name="url" required>
        
        <label for="content_type">Select Content Type:</label>
        <select name="content_type">
            <option value="static_html">Static HTML</option>
            <option value="js_rendered">JavaScript Rendered</option>
            <option value="pdf">PDF</option>
        </select>
        
        <button type="submit">Extract Emails</button>
    </form>
</body>
</html>

Breakdown of `index.php`

Form Handling: The form captures user input for the URL and content type. When submitted, it triggers a POST request to extract emails.
Content Type Logic: Based on the selected content type, the appropriate extractor class is instantiated. For PDFs, we download the file temporarily to process it.
Database Insertion: Extracted emails are inserted into the database using a prepared statement, which helps prevent SQL injection.
User Feedback: The tool displays the extracted email addresses to the user.

Conclusion

In this blog post, we successfully built a comprehensive email extraction tool that integrates multiple techniques for extracting email addresses from various content types. By using PHP and MySQL, we created a flexible and efficient application capable of handling static HTML, JavaScript-rendered content, and PDF files seamlessly.