Creating a Command-Line Email Extractor in Ruby

Email extraction is a crucial task in various domains like marketing, data collection, and web scraping. In this blog, we will walk you through the process of building a command-line email extractor using Ruby. With its simplicity and flexibility, Ruby is a fantastic choice for developing such tools.

Why Use Ruby for Email Extraction?

Ruby is a dynamic, object-oriented programming language known for its readability and ease of use. It’s great for scripting and automating tasks, making it a perfect fit for building a command-line email extractor. The goal is to build a tool that reads a text file, scans its content for email addresses, and outputs the results.

Prerequisites

Before you start, ensure you have the following:

Ruby installed on your machine (version 2.5 or higher)
Basic understanding of Ruby and regular expressions

You can check your Ruby version using:

ruby -v

If Ruby isn’t installed, you can download it from Ruby’s official site.

Step 1: Setting Up the Project

Let’s begin by creating a new Ruby file for our email extractor:

touch email_extractor.rb

Open this file in your favorite text editor, and let’s start coding.

Step 2: Reading the Input File

First, we need to handle reading a text file provided by the user. You can use Ruby’s File class to read the content:

# email_extractor.rb

filename = ARGV[0]

if filename.nil?
  puts "Please provide a file name as an argument."
  exit
end

begin
  file_content = File.read(filename)
rescue Errno::ENOENT
  puts "File not found: #{filename}"
  exit
end

This code will read the filename from the command-line arguments and handle file reading errors gracefully.

Step 3: Using Regular Expressions to Find Emails

Emails follow a standard format, and regular expressions (regex) are perfect for identifying patterns in text. We’ll use a basic regex to find email addresses:

# Basic email regex
email_regex = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/

# Extract emails from the content
emails = file_content.scan(email_regex)

# Display the extracted emails
if emails.empty?
  puts "No emails found in the file."
else
  puts "Extracted Emails:"
  puts emails.uniq
end

# Basic email regex
email_regex = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/

# Extract emails from the content
emails = file_content.scan(email_regex)

# Display the extracted emails
if emails.empty?
  puts "No emails found in the file."
else
  puts "Extracted Emails:"
  puts emails.uniq
end

Here, we use the scan method to search the content for all matches of the email_regex. We also ensure that only unique email addresses are displayed.

Step 4: Enhancing the Email Extractor

While the basic extractor works, it can be improved to handle different edge cases. For example, we can allow input from a URL, sanitize the extracted emails, or even write the output to a new file.

Let’s add an option to save the extracted emails to a file:

# Save emails to a file if the user provides an output filename
output_file = ARGV[1]

if output_file
  File.open(output_file, "w") do |file|
    emails.uniq.each { |email| file.puts email }
  end
  puts "Emails saved to #{output_file}"
else
  puts emails.uniq
end

Now the user can specify an output file, like so:

ruby email_extractor.rb input.txt output_emails.txt

Step 5: Testing the Command-Line Email Extractor

To test your script, create a sample text file, input.txt, containing email addresses:

[email protected]
[email protected]
invalid-[email protected]

Run your script from the command line:

ruby email_extractor.rb input.txt

You should see the valid email addresses extracted from the file. If an output file is provided, the emails will also be saved there.

Conclusion

In this blog, we have built a simple yet powerful command-line email extractor using Ruby. This tool can be extended in various ways, such as integrating web scraping functionality or applying more complex regex for different email formats. With Ruby’s flexibility, the possibilities are endless!