Posted on Leave a comment

Email Validation in Java: Ensuring Accuracy in Scraped Data

Introduction

When scraping emails from the web, you’ll often encounter invalid or malformed email addresses. Some scraped data may contain fake, incomplete, or improperly formatted emails, which can lead to issues when trying to use them for further applications like email marketing or analysis.

In this blog, we will explore how to validate scraped email addresses in Java to ensure their accuracy and quality. By implementing proper validation techniques, you can filter out invalid emails and maintain a high-quality dataset.

We will cover:

  • Basic email format validation using regular expressions.
  • Advanced validation with the JavaMail API for domain-level checks.
  • Implementing email deduplication to avoid multiple instances of the same email.

Step 1: Why Email Validation is Important

Email validation helps you:

  • Avoid fake or mistyped emails that won’t deliver.
  • Ensure proper communication with valid contacts.
  • Optimize marketing efforts by reducing bounces and spam complaints.
  • Maintain clean databases with accurate and unique email addresses.

Step 2: Basic Email Format Validation Using Regular Expressions

The first step in email validation is checking if the email has a valid format. This can be done using regular expressions (regex), which provide a way to define a pattern that valid emails must follow.

A basic regex pattern for email validation in Java can look like this:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class EmailValidator {

    private static final String EMAIL_REGEX = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6}$";
    
    public static boolean isValidEmail(String email) {
        Pattern pattern = Pattern.compile(EMAIL_REGEX);
        Matcher matcher = pattern.matcher(email);
        return matcher.matches();
    }

    public static void main(String[] args) {
        String[] emails = {"[email protected]", "invalid-email", "user@domain", "[email protected]"};
        
        for (String email : emails) {
            System.out.println(email + " is valid: " + isValidEmail(email));
        }
    }
}
Code Breakdown:
  • The EMAIL_REGEX is used to define the pattern of a valid email address. It checks for:
    • Alphanumeric characters, underscores, dots, and percentage signs before the @ symbol.
    • A valid domain name after the @ symbol, with a top-level domain (TLD) of at least two characters (e.g., .com, .org).
  • The isValidEmail() method returns true if the email matches the pattern, otherwise false.
Example Output:
[email protected] is valid: true
invalid-email is valid: false
user@domain is valid: false
[email protected] is valid: true

This basic approach filters out emails that don’t meet common formatting rules, but it won’t detect whether the domain exists or if the email is actually deliverable.

Step 3: Advanced Email Validation Using JavaMail API

To perform more advanced validation, we can use the JavaMail API to verify whether the domain of the email address is valid and capable of receiving emails. JavaMail allows us to interact with mail servers and perform DNS lookups to check if an email address’s domain has an active mail server (MX record).

Setting Up JavaMail

First, add the following dependencies to your Maven pom.xml:

<dependencies>
    <dependency>
        <groupId>javax.mail</groupId>
        <artifactId>javax.mail-api</artifactId>
        <version>1.6.2</version>
    </dependency>
    <dependency>
        <groupId>com.sun.mail</groupId>
        <artifactId>javax.mail</artifactId>
        <version>1.6.2</version>
    </dependency>
</dependencies>
Domain-Level Email Validation

Here’s how you can validate email addresses at the domain level using JavaMail:

import javax.mail.internet.AddressException;
import javax.mail.internet.InternetAddress;
import java.net.InetAddress;
import java.util.Arrays;

public class AdvancedEmailValidator {

    public static boolean isValidEmailAddress(String email) {
        try {
            // Check the email format
            InternetAddress emailAddress = new InternetAddress(email);
            emailAddress.validate();

            // Extract the domain and check if it has a valid MX record
            String domain = email.substring(email.indexOf("@") + 1);
            return hasMXRecord(domain);
        } catch (AddressException ex) {
            return false;
        }
    }

    public static boolean hasMXRecord(String domain) {
        try {
            InetAddress[] addresses = InetAddress.getAllByName(domain);
            return addresses.length > 0;
        } catch (Exception e) {
            return false;
        }
    }

    public static void main(String[] args) {
        String[] emails = {"[email protected]", "[email protected]", "[email protected]"};
        
        Arrays.stream(emails).forEach(email -> {
            boolean isValid = isValidEmailAddress(email);
            System.out.println(email + " is valid: " + isValid);
        });
    }
}
Code Breakdown:
  • We use InternetAddress from the JavaMail API to validate the basic format of the email address.
  • The hasMXRecord() method checks if the email’s domain has a valid MX record by performing a DNS lookup. If the domain is capable of receiving emails, it will have an MX record.
Example Output:
[email protected] is valid: true
[email protected] is valid: false
[email protected] is valid: true

Step 4: Handling Email Deduplication

After scraping and validating emails, you may end up with multiple instances of the same email address. To avoid this, you need to implement deduplication, ensuring each email is only stored once.

Here’s an approach using a Set to remove duplicates:

import java.util.HashSet;
import java.util.Set;

public class EmailDeduplication {

    public static void main(String[] args) {
        Set<String> emailSet = new HashSet<>();

        String[] emails = {"[email protected]", "[email protected]", "[email protected]", "[email protected]"};

        for (String email : emails) {
            if (emailSet.add(email)) {
                System.out.println("Added: " + email);
            } else {
                System.out.println("Duplicate: " + email);
            }
        }
    }
}
Code Breakdown:
  • A HashSet automatically removes duplicates because sets do not allow duplicate elements.
  • The add() method returns false if the email is already present in the set, allowing you to identify and handle duplicates.
Example Output:

Step 5: Validating Scraped Emails in Practice

When validating scraped emails in your email scraping application, follow these steps:

  1. Extract emails from web pages using your scraping tool (e.g., Selenium, Jsoup).
  2. Use regex to filter out invalid email formats.
  3. Verify domains using the JavaMail API to ensure they can receive emails.
  4. Remove duplicates using sets or other deduplication methods.

By following this process, you can ensure that your email list is both accurate and unique, reducing bounce rates and improving the quality of your scraped data.

Conclusion

Email validation is a critical step when working with scraped data. In this blog, we covered:

  • Basic format validation with regular expressions.
  • Advanced domain validation using the JavaMail API to check for MX records.
  • Deduplication techniques to ensure unique emails.

Posted on Leave a comment

How to Scrape Emails from Dynamic Websites with Java: Best Methods and Tools

Introduction

In the previous blogs, we explored how to scrape static web pages using Java and Jsoup. While Jsoup is an excellent tool for parsing HTML documents, it struggles with web pages that load content dynamically through JavaScript. Many modern websites rely heavily on JavaScript for displaying content, making traditional HTML parsing ineffective.

In this blog, we will look at how to scrape dynamic web pages in Java. To achieve this, we’ll explore Selenium, a powerful web automation tool, and show you how to use it for scraping dynamic content such as email addresses.

What Are Dynamic Web Pages?

Dynamic web pages load part or all of their content after the initial HTML page load. Instead of sending fully rendered HTML from the server, dynamic pages often rely on JavaScript to fetch data and render it on the client side.

Here’s an example of a typical dynamic page behavior:

  • The initial HTML page is loaded with placeholders or a basic structure.
  • JavaScript executes and fetches data asynchronously using AJAX (Asynchronous JavaScript and XML).
  • Content is dynamically injected into the DOM after the page has loaded.

Since Jsoup fetches only the static HTML (before JavaScript runs), it won’t capture this dynamic content. For these cases, we need a tool like Selenium that can interact with a fully rendered web page.

Step 1: Setting Up Selenium for Java

Selenium is a browser automation tool that allows you to interact with web pages just like a real user would. It executes JavaScript, loads dynamic content, and can simulate clicks, form submissions, and other interactions.

Installing Selenium

To use Selenium with Java, you need to:

  1. Install the Selenium WebDriver.
  2. Set up a browser driver (e.g., ChromeDriver for Chrome).

First, add the Selenium dependency to your Maven pom.xml:

<dependencies>
    <dependency>
        <groupId>org.seleniumhq.selenium</groupId>
        <artifactId>selenium-java</artifactId>
        <version>4.0.0</version>
    </dependency>
</dependencies>

Next, download the appropriate browser driver. For example, if you are using Chrome, download ChromeDriver from here.

Make sure the driver is placed in a directory that is accessible by your Java program. For instance, you can set its path in your system’s environment variables or specify it directly in your code.

Step 2: Writing a Basic Selenium Email Scraper

Now, let’s write a simple Selenium-based scraper to handle a dynamic web page.

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;

import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class DynamicEmailScraper {

    public static void main(String[] args) {
        // Set the path to your ChromeDriver executable
        System.setProperty("webdriver.chrome.driver", "path/to/chromedriver");

        // Create a new instance of the Chrome driver
        WebDriver driver = new ChromeDriver();

        try {
            // Open the dynamic web page
            driver.get("https://example.com"); // Replace with your target URL

            // Wait for the page to load and dynamic content to be fully rendered
            Thread.sleep(5000); // Adjust this depending on page load time

            // Extract the page source after the JavaScript has executed
            String pageSource = driver.getPageSource();

            // Regular expression to find emails
            String emailRegex = "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6}";
            Pattern emailPattern = Pattern.compile(emailRegex);
            Matcher emailMatcher = emailPattern.matcher(pageSource);

            // Print out all found email addresses
            while (emailMatcher.find()) {
                System.out.println("Found email: " + emailMatcher.group());
            }

        } catch (InterruptedException e) {
            e.printStackTrace();
        } finally {
            // Close the browser
            driver.quit();
        }
    }
}
Code Breakdown:
  • We start by setting the path to ChromeDriver and creating an instance of ChromeDriver to control the Chrome browser.
  • The get() method is used to load the desired dynamic web page.
  • We use Thread.sleep() to wait for a few seconds, allowing time for the JavaScript to execute and the dynamic content to load. (For a better approach, consider using Selenium’s explicit waits to wait for specific elements instead of relying on Thread.sleep().)
  • Once the content is loaded, we retrieve the fully rendered HTML using getPageSource(), then search for emails using a regex pattern.

Step 3: Handling Dynamic Content with Explicit Waits

In real-world scenarios, using Thread.sleep() is not ideal as it makes the program wait unnecessarily. A better way to handle dynamic content is to use explicit waits, where Selenium waits for a specific condition to be met before proceeding.

Here’s an improved version of our scraper using WebDriverWait:

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;

import java.time.Duration;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class DynamicEmailScraperWithWaits {

    public static void main(String[] args) {
        // Set the path to your ChromeDriver executable
        System.setProperty("webdriver.chrome.driver", "path/to/chromedriver");

        // Create a new instance of the Chrome driver
        WebDriver driver = new ChromeDriver();

        try {
            // Open the dynamic web page
            driver.get("https://example.com"); // Replace with your target URL

            // Create an explicit wait
            WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(10));

            // Wait until a specific element (e.g., a div with class 'contact-info') is visible
            WebElement contactDiv = wait.until(
                ExpectedConditions.visibilityOfElementLocated(By.className("contact-info"))
            );

            // Extract the page source after the dynamic content has loaded
            String pageSource = driver.getPageSource();

            // Regular expression to find emails
            String emailRegex = "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6}";
            Pattern emailPattern = Pattern.compile(emailRegex);
            Matcher emailMatcher = emailPattern.matcher(pageSource);

            // Print out all found email addresses
            while (emailMatcher.find()) {
                System.out.println("Found email: " + emailMatcher.group());
            }

        } finally {
            // Close the browser
            driver.quit();
        }
    }
}
How This Works:
  • We replaced Thread.sleep() with WebDriverWait to wait for a specific element (e.g., a div with the class contact-info) to be visible.
  • ExpectedConditions is used to wait until the element is available in the DOM. This ensures that the dynamic content is fully loaded before attempting to scrape the page.

Step 4: Extracting Emails from Specific Elements

Instead of searching the entire page source for emails, you might want to target specific sections where emails are more likely to appear. Here’s how to scrape emails from a particular element, such as a footer or contact section.

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class SpecificSectionEmailScraper {

    public static void main(String[] args) {
        // Set the path to your ChromeDriver executable
        System.setProperty("webdriver.chrome.driver", "path/to/chromedriver");

        // Create a new instance of the Chrome driver
        WebDriver driver = new ChromeDriver();

        try {
            // Open the dynamic web page
            driver.get("https://example.com"); // Replace with your target URL

            // Wait for a specific section (e.g., the footer)
            WebElement footer = driver.findElement(By.tagName("footer"));

            // Extract text from the footer
            String footerText = footer.getText();

            // Regular expression to find emails
            String emailRegex = "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6}";
            Pattern emailPattern = Pattern.compile(emailRegex);
            Matcher emailMatcher = emailPattern.matcher(footerText);

            // Print out all found email addresses in the footer
            while (emailMatcher.find()) {
                System.out.println("Found email: " + emailMatcher.group());
            }

        } finally {
            // Close the browser
            driver.quit();
        }
    }
}

Step 5: Handling AJAX Requests

Some websites load their content via AJAX requests. In these cases, you can use Selenium to wait for the AJAX call to complete before scraping the content. WebDriverWait can help detect when the AJAX call is done and the new content is available in the DOM.

Conclusion

In this blog, we covered how to scrape dynamic web pages using Selenium in Java. We explored how Selenium handles JavaScript, loads dynamic content, and how you can extract email addresses from these pages. Key takeaways include:

  • Setting up Selenium for web scraping.
  • Using explicit waits to handle dynamic content.
  • Extracting emails from specific elements like footers or contact sections.

In the next blog, we’ll dive deeper into handling websites with anti-scraping mechanisms and how to bypass common challenges such as CAPTCHA and JavaScript-based blocking.

Posted on Leave a comment

How to Extract Emails from Web Pages Using Jsoup in Java: A Step-by-Step Guide

Introduction

In our previous blog, we set up a Java environment for scraping emails and wrote a basic program to extract email addresses from a simple HTML page. Now, it’s time to dive deeper into the powerful Java library Jsoup, which makes web scraping easy and efficient.

In this blog, we will explore how to parse HTML pages using Jsoup to extract emails with more precision, handle various HTML structures, and manage different elements within a webpage.

What is Jsoup?

Jsoup is a popular Java library that allows you to manipulate HTML documents like a web browser does. With Jsoup, you can:

  • Fetch and parse HTML documents.
  • Extract and manipulate data, such as email addresses, from web pages.
  • Clean and sanitize user-submitted content against malicious code.

Jsoup is ideal for static HTML content scraping and works well with websites that don’t require JavaScript rendering for the core content.

Step 1: Adding Jsoup to Your Project

Before we start coding, make sure you have added the Jsoup dependency to your Maven project. If you missed it in the previous blog, here’s the pom.xml configuration again:

<dependencies>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.14.3</version>
    </dependency>
</dependencies>

This will pull in Jsoup and all required dependencies into your project.

Step 2: Fetching and Parsing HTML Documents

Let’s start by writing a basic program to fetch and parse a webpage’s HTML content using Jsoup. We’ll expand this to handle multiple elements and extract emails from different parts of the webpage.

Basic HTML Parsing with Jsoup

Here’s a simple example that demonstrates how to fetch a web page and display its title and body text:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.IOException;

public class BasicHtmlParser {

    public static void main(String[] args) {
        String url = "https://example.com"; // Replace with your target URL

        try {
            // Fetch the HTML document
            Document doc = Jsoup.connect(url).get();

            // Print the page title
            String title = doc.title();
            System.out.println("Title: " + title);

            // Print the body text of the page
            String bodyText = doc.body().text();
            System.out.println("Body Text: " + bodyText);

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

This example shows how to use Jsoup’s connect() method to fetch a web page and extract the title and body text. Now, we can use this HTML content to extract emails.

Step 3: Extracting Emails from Parsed HTML

Once the HTML is parsed, we can apply regular expressions (regex) to locate email addresses within the HTML content. Let’s modify our example to include email extraction.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class EmailExtractor {

    public static void main(String[] args) {
        String url = "https://example.com"; // Replace with your target URL

        try {
            // Fetch the HTML document
            Document doc = Jsoup.connect(url).get();

            // Extract the body text of the page
            String bodyText = doc.body().text();

            // Regular expression for finding email addresses
            String emailRegex = "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6}";
            Pattern emailPattern = Pattern.compile(emailRegex);
            Matcher emailMatcher = emailPattern.matcher(bodyText);

            // Print all found emails
            while (emailMatcher.find()) {
                System.out.println("Found email: " + emailMatcher.group());
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Here, we fetch the web page, extract the body text, and then apply a regex pattern to find email addresses. This method works well for simple static web pages, but we can enhance it to target more specific sections of the HTML document.

Step 4: Targeting Specific HTML Elements

Instead of scanning the entire page, you may want to scrape emails from specific sections, such as the footer or contact information section. Jsoup allows you to select specific HTML elements using CSS-like selectors.

Selecting Elements with Jsoup

Let’s say you want to scrape emails only from a <div> with a class contact-info. Here’s how you can do it:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class SpecificElementEmailScraper {

    public static void main(String[] args) {
        String url = "https://example.com"; // Replace with your target URL

        try {
            // Fetch the HTML document
            Document doc = Jsoup.connect(url).get();

            // Select the specific div with class 'contact-info'
            Elements contactSections = doc.select("div.contact-info");

            // Iterate through selected elements and search for emails
            for (Element section : contactSections) {
                String sectionText = section.text();

                // Regular expression for finding email addresses
                String emailRegex = "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6}";
                Pattern emailPattern = Pattern.compile(emailRegex);
                Matcher emailMatcher = emailPattern.matcher(sectionText);

                // Print all found emails in the section
                while (emailMatcher.find()) {
                    System.out.println("Found email: " + emailMatcher.group());
                }
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

In this example, we use Jsoup’s select() method with a CSS selector to target the specific <div> element containing the contact information. This helps narrow down the search, making email extraction more precise.

Step 5: Handling Multiple Elements and Pages

Sometimes, you need to scrape multiple sections or pages. For instance, if you’re scraping a website with paginated contact listings, you can use Jsoup to extract emails from all those pages by looping through them or following links.

Here’s an approach to scraping emails from multiple pages:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class MultiPageEmailScraper {

    public static void main(String[] args) {
        String baseUrl = "https://example.com/page/"; // Base URL for paginated pages

        // Loop through the first 5 pages
        for (int i = 1; i <= 5; i++) {
            String url = baseUrl + i;

            try {
                // Fetch each page
                Document doc = Jsoup.connect(url).get();

                // Select the contact-info div on the page
                Elements contactSections = doc.select("div.contact-info");

                for (Element section : contactSections) {
                    String sectionText = section.text();

                    // Regular expression for finding email addresses
                    String emailRegex = "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6}";
                    Pattern emailPattern = Pattern.compile(emailRegex);
                    Matcher emailMatcher = emailPattern.matcher(sectionText);

                    while (emailMatcher.find()) {
                        System.out.println("Found email: " + emailMatcher.group());
                    }
                }

            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

This code example shows how to scrape emails from multiple pages by dynamically changing the URL for each page. The number of pages can be adjusted based on your target site’s pagination.

Conclusion

In this blog, we explored how to use Jsoup to parse HTML documents and extract email addresses. We learned how to:

  • Fetch and parse web pages using Jsoup.
  • Target specific HTML elements using CSS selectors.
  • Apply regular expressions to extract email addresses.
  • Scrape emails from multiple pages.

In the next blog, we’ll look at how to handle dynamic web pages that use JavaScript to load content and how to scrape them effectively using Java.

Posted on Leave a comment

Introduction to Email Scraping with Java: Setting Up Your Environment

Introduction

In today’s digital age, email scraping has become an essential tool for gathering contact information from the web for business and marketing purposes. In this blog series, we’ll explore how to implement email scraping using Java. We’ll start by setting up your environment and going over the essential tools you’ll need to build a powerful email scraper.

By the end of this post, you’ll have your Java environment ready for scraping emails from websites. Let’s dive into the basics of email scraping and how to set up your project for success.

What is Email Scraping?

Email scraping refers to the automated extraction of email addresses from websites or documents. It is a key technique for gathering contact information for lead generation, email marketing, or data collection purposes. However, it’s important to ensure compliance with legal frameworks like the GDPR when scraping emails to avoid breaching privacy regulations.

Tools and Libraries You’ll Need

Before we begin writing code, let’s go over the tools and libraries you’ll need for this project:

  1. Java Development Kit (JDK): We’ll use Java for this project, so you need to have the JDK installed on your system. You can download the latest version from the Oracle JDK website.
  2. IDE (Integrated Development Environment): While you can use any text editor, an IDE like IntelliJ IDEA or Eclipse will make development easier. IntelliJ IDEA is highly recommended due to its rich features and built-in support for Java.
  3. Maven or Gradle: These build tools are widely used for managing dependencies and project builds. We’ll use Maven in this example, but you can also use Gradle if that’s your preference.
  4. Jsoup Library: Jsoup is a popular Java library for parsing HTML documents. It allows you to extract and manipulate data from web pages easily. You can include Jsoup as a Maven dependency in your project (we’ll show you how below).
  5. Selenium (optional): Selenium allows you to interact with dynamic web pages (those that use JavaScript to load content). You might need it in more advanced scraping scenarios where basic HTML parsing doesn’t suffice.

Step 1: Setting Up Your Java Development Environment

To get started, you’ll need to ensure that your system is set up to run Java programs.

  1. Install the JDK
    Download and install the JDK from the Oracle website. Follow the installation instructions for your OS (Windows, Mac, Linux).After installation, check that Java is correctly installed by running this command in the terminal or command prompt:
java -version
  1. You should see a version number confirming that Java is installed.
  2. Set Up Your IDE
    Download and install IntelliJ IDEA or Eclipse. These IDEs provide excellent support for Java development. Once installed, create a new Java project to begin working on your email scraper.

Step 2: Setting Up Maven and Adding Dependencies

We’ll use Maven to manage our project’s dependencies, such as the Jsoup library. If you don’t have Maven installed, you can download it from the official Maven website and follow the setup instructions.

Once you’ve set up Maven, create a new Maven project in your IDE. In the pom.xml file, add the following dependency for Jsoup:

<dependencies>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.14.3</version>
    </dependency>
</dependencies>

This will allow you to use Jsoup in your project to parse HTML documents and extract emails.

Step 3: Writing a Basic Email Scraping Program

With your environment set up, let’s write a basic Java program that scrapes a web page for email addresses.

  1. Create a Java Class
    Create a new class EmailScraper.java in your project. This class will contain the logic to scrape email addresses.
  2. Parsing a Web Page with Jsoup
    Now let’s write some code to scrape emails. In this example, we’ll scrape a basic HTML page and search for any email addresses within the content.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class EmailScraper {

    public static void main(String[] args) {
        String url = "https://example.com"; // Replace with your target URL

        try {
            // Fetch the HTML document from the URL
            Document doc = Jsoup.connect(url).get();
            String htmlContent = doc.text();

            // Regular expression to find emails
            String emailRegex = "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6}";
            Pattern emailPattern = Pattern.compile(emailRegex);
            Matcher emailMatcher = emailPattern.matcher(htmlContent);

            // Print all the emails found
            while (emailMatcher.find()) {
                System.out.println("Found email: " + emailMatcher.group());
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Code Explanation

  • We use Jsoup to connect to the website and fetch the HTML content.
  • Regex is used to search for email patterns in the text. The regular expression we use matches most common email formats.
  • Finally, we print out all the emails found on the page.

Step 4: Running the Program

You can now run your EmailScraper.java class to test if it scrapes emails from the given web page. If the page contains any valid email addresses, they will be printed in the console.

Conclusion

In this first post of the series, we’ve covered the basics of setting up a Java environment for email scraping, introduced key libraries like Jsoup, and written a simple program to extract emails from a web page. In the next blog, we’ll dive deeper into handling more complex websites and parsing their dynamic content.