How to Extract Emails from Web Pages Using Jsoup in Java: A Step-by-Step Guide
Introduction
In our previous blog, we set up a Java environment for scraping emails and wrote a basic program to extract email addresses from a simple HTML page. Now, it’s time to dive deeper into the powerful Java library Jsoup, which makes web scraping easy and efficient.
In this blog, we will explore how to parse HTML pages using Jsoup to extract emails with more precision, handle various HTML structures, and manage different elements within a webpage.
What is Jsoup?
Jsoup is a popular Java library that allows you to manipulate HTML documents like a web browser does. With Jsoup, you can:
- Fetch and parse HTML documents.
- Extract and manipulate data, such as email addresses, from web pages.
- Clean and sanitize user-submitted content against malicious code.
Jsoup is ideal for static HTML content scraping and works well with websites that don’t require JavaScript rendering for the core content.
Step 1: Adding Jsoup to Your Project
Before we start coding, make sure you have added the Jsoup dependency to your Maven project. If you missed it in the previous blog, here’s the pom.xml
configuration again:
<dependencies>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.14.3</version>
</dependency>
</dependencies>
This will pull in Jsoup and all required dependencies into your project.
Step 2: Fetching and Parsing HTML Documents
Let’s start by writing a basic program to fetch and parse a webpage’s HTML content using Jsoup. We’ll expand this to handle multiple elements and extract emails from different parts of the webpage.
Basic HTML Parsing with Jsoup
Here’s a simple example that demonstrates how to fetch a web page and display its title and body text:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
public class BasicHtmlParser {
public static void main(String[] args) {
String url = "https://example.com"; // Replace with your target URL
try {
// Fetch the HTML document
Document doc = Jsoup.connect(url).get();
// Print the page title
String title = doc.title();
System.out.println("Title: " + title);
// Print the body text of the page
String bodyText = doc.body().text();
System.out.println("Body Text: " + bodyText);
} catch (IOException e) {
e.printStackTrace();
}
}
}
This example shows how to use Jsoup’s connect()
method to fetch a web page and extract the title and body text. Now, we can use this HTML content to extract emails.
Step 3: Extracting Emails from Parsed HTML
Once the HTML is parsed, we can apply regular expressions (regex) to locate email addresses within the HTML content. Let’s modify our example to include email extraction.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class EmailExtractor {
public static void main(String[] args) {
String url = "https://example.com"; // Replace with your target URL
try {
// Fetch the HTML document
Document doc = Jsoup.connect(url).get();
// Extract the body text of the page
String bodyText = doc.body().text();
// Regular expression for finding email addresses
String emailRegex = "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6}";
Pattern emailPattern = Pattern.compile(emailRegex);
Matcher emailMatcher = emailPattern.matcher(bodyText);
// Print all found emails
while (emailMatcher.find()) {
System.out.println("Found email: " + emailMatcher.group());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Here, we fetch the web page, extract the body text, and then apply a regex pattern to find email addresses. This method works well for simple static web pages, but we can enhance it to target more specific sections of the HTML document.
Step 4: Targeting Specific HTML Elements
Instead of scanning the entire page, you may want to scrape emails from specific sections, such as the footer or contact information section. Jsoup allows you to select specific HTML elements using CSS-like selectors.
Selecting Elements with Jsoup
Let’s say you want to scrape emails only from a <div>
with a class contact-info
. Here’s how you can do it:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class SpecificElementEmailScraper {
public static void main(String[] args) {
String url = "https://example.com"; // Replace with your target URL
try {
// Fetch the HTML document
Document doc = Jsoup.connect(url).get();
// Select the specific div with class 'contact-info'
Elements contactSections = doc.select("div.contact-info");
// Iterate through selected elements and search for emails
for (Element section : contactSections) {
String sectionText = section.text();
// Regular expression for finding email addresses
String emailRegex = "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6}";
Pattern emailPattern = Pattern.compile(emailRegex);
Matcher emailMatcher = emailPattern.matcher(sectionText);
// Print all found emails in the section
while (emailMatcher.find()) {
System.out.println("Found email: " + emailMatcher.group());
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
In this example, we use Jsoup’s select()
method with a CSS selector to target the specific <div>
element containing the contact information. This helps narrow down the search, making email extraction more precise.
Step 5: Handling Multiple Elements and Pages
Sometimes, you need to scrape multiple sections or pages. For instance, if you’re scraping a website with paginated contact listings, you can use Jsoup to extract emails from all those pages by looping through them or following links.
Here’s an approach to scraping emails from multiple pages:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class MultiPageEmailScraper {
public static void main(String[] args) {
String baseUrl = "https://example.com/page/"; // Base URL for paginated pages
// Loop through the first 5 pages
for (int i = 1; i <= 5; i++) {
String url = baseUrl + i;
try {
// Fetch each page
Document doc = Jsoup.connect(url).get();
// Select the contact-info div on the page
Elements contactSections = doc.select("div.contact-info");
for (Element section : contactSections) {
String sectionText = section.text();
// Regular expression for finding email addresses
String emailRegex = "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6}";
Pattern emailPattern = Pattern.compile(emailRegex);
Matcher emailMatcher = emailPattern.matcher(sectionText);
while (emailMatcher.find()) {
System.out.println("Found email: " + emailMatcher.group());
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
This code example shows how to scrape emails from multiple pages by dynamically changing the URL for each page. The number of pages can be adjusted based on your target site’s pagination.
Conclusion
In this blog, we explored how to use Jsoup to parse HTML documents and extract email addresses. We learned how to:
- Fetch and parse web pages using Jsoup.
- Target specific HTML elements using CSS selectors.
- Apply regular expressions to extract email addresses.
- Scrape emails from multiple pages.
In the next blog, we’ll look at how to handle dynamic web pages that use JavaScript to load content and how to scrape them effectively using Java.