| |

How to Extract Emails from Web Pages Using Jsoup in Java: A Step-by-Step Guide

Introduction

In our previous blog, we set up a Java environment for scraping emails and wrote a basic program to extract email addresses from a simple HTML page. Now, it’s time to dive deeper into the powerful Java library Jsoup, which makes web scraping easy and efficient.

In this blog, we will explore how to parse HTML pages using Jsoup to extract emails with more precision, handle various HTML structures, and manage different elements within a webpage.

What is Jsoup?

Jsoup is a popular Java library that allows you to manipulate HTML documents like a web browser does. With Jsoup, you can:

  • Fetch and parse HTML documents.
  • Extract and manipulate data, such as email addresses, from web pages.
  • Clean and sanitize user-submitted content against malicious code.

Jsoup is ideal for static HTML content scraping and works well with websites that don’t require JavaScript rendering for the core content.

Step 1: Adding Jsoup to Your Project

Before we start coding, make sure you have added the Jsoup dependency to your Maven project. If you missed it in the previous blog, here’s the pom.xml configuration again:

<dependencies>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.14.3</version>
    </dependency>
</dependencies>

This will pull in Jsoup and all required dependencies into your project.

Step 2: Fetching and Parsing HTML Documents

Let’s start by writing a basic program to fetch and parse a webpage’s HTML content using Jsoup. We’ll expand this to handle multiple elements and extract emails from different parts of the webpage.

Basic HTML Parsing with Jsoup

Here’s a simple example that demonstrates how to fetch a web page and display its title and body text:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.IOException;

public class BasicHtmlParser {

    public static void main(String[] args) {
        String url = "https://example.com"; // Replace with your target URL

        try {
            // Fetch the HTML document
            Document doc = Jsoup.connect(url).get();

            // Print the page title
            String title = doc.title();
            System.out.println("Title: " + title);

            // Print the body text of the page
            String bodyText = doc.body().text();
            System.out.println("Body Text: " + bodyText);

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

This example shows how to use Jsoup’s connect() method to fetch a web page and extract the title and body text. Now, we can use this HTML content to extract emails.

Step 3: Extracting Emails from Parsed HTML

Once the HTML is parsed, we can apply regular expressions (regex) to locate email addresses within the HTML content. Let’s modify our example to include email extraction.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class EmailExtractor {

    public static void main(String[] args) {
        String url = "https://example.com"; // Replace with your target URL

        try {
            // Fetch the HTML document
            Document doc = Jsoup.connect(url).get();

            // Extract the body text of the page
            String bodyText = doc.body().text();

            // Regular expression for finding email addresses
            String emailRegex = "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6}";
            Pattern emailPattern = Pattern.compile(emailRegex);
            Matcher emailMatcher = emailPattern.matcher(bodyText);

            // Print all found emails
            while (emailMatcher.find()) {
                System.out.println("Found email: " + emailMatcher.group());
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Here, we fetch the web page, extract the body text, and then apply a regex pattern to find email addresses. This method works well for simple static web pages, but we can enhance it to target more specific sections of the HTML document.

Step 4: Targeting Specific HTML Elements

Instead of scanning the entire page, you may want to scrape emails from specific sections, such as the footer or contact information section. Jsoup allows you to select specific HTML elements using CSS-like selectors.

Selecting Elements with Jsoup

Let’s say you want to scrape emails only from a <div> with a class contact-info. Here’s how you can do it:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class SpecificElementEmailScraper {

    public static void main(String[] args) {
        String url = "https://example.com"; // Replace with your target URL

        try {
            // Fetch the HTML document
            Document doc = Jsoup.connect(url).get();

            // Select the specific div with class 'contact-info'
            Elements contactSections = doc.select("div.contact-info");

            // Iterate through selected elements and search for emails
            for (Element section : contactSections) {
                String sectionText = section.text();

                // Regular expression for finding email addresses
                String emailRegex = "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6}";
                Pattern emailPattern = Pattern.compile(emailRegex);
                Matcher emailMatcher = emailPattern.matcher(sectionText);

                // Print all found emails in the section
                while (emailMatcher.find()) {
                    System.out.println("Found email: " + emailMatcher.group());
                }
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

In this example, we use Jsoup’s select() method with a CSS selector to target the specific <div> element containing the contact information. This helps narrow down the search, making email extraction more precise.

Step 5: Handling Multiple Elements and Pages

Sometimes, you need to scrape multiple sections or pages. For instance, if you’re scraping a website with paginated contact listings, you can use Jsoup to extract emails from all those pages by looping through them or following links.

Here’s an approach to scraping emails from multiple pages:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class MultiPageEmailScraper {

    public static void main(String[] args) {
        String baseUrl = "https://example.com/page/"; // Base URL for paginated pages

        // Loop through the first 5 pages
        for (int i = 1; i <= 5; i++) {
            String url = baseUrl + i;

            try {
                // Fetch each page
                Document doc = Jsoup.connect(url).get();

                // Select the contact-info div on the page
                Elements contactSections = doc.select("div.contact-info");

                for (Element section : contactSections) {
                    String sectionText = section.text();

                    // Regular expression for finding email addresses
                    String emailRegex = "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6}";
                    Pattern emailPattern = Pattern.compile(emailRegex);
                    Matcher emailMatcher = emailPattern.matcher(sectionText);

                    while (emailMatcher.find()) {
                        System.out.println("Found email: " + emailMatcher.group());
                    }
                }

            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

This code example shows how to scrape emails from multiple pages by dynamically changing the URL for each page. The number of pages can be adjusted based on your target site’s pagination.

Conclusion

In this blog, we explored how to use Jsoup to parse HTML documents and extract email addresses. We learned how to:

  • Fetch and parse web pages using Jsoup.
  • Target specific HTML elements using CSS selectors.
  • Apply regular expressions to extract email addresses.
  • Scrape emails from multiple pages.

In the next blog, we’ll look at how to handle dynamic web pages that use JavaScript to load content and how to scrape them effectively using Java.

Similar Posts