Extract Email Addresses From Text

From RL3 Wiki
Jump to: navigation, search
Robot-reads.jpg

It is quite a frequent situation when you need to extract all e-mail addresses from a text file having emails mixed with other text.

If you have a Unix-based system, then you can try to solve this task with tools like grep and simple regular expression like [[:alnum:]+\.\_\-]*@[[:alnum:]\.\-]*. For an input file test.txt the command may look like:

$ grep -o "[[:alnum:]+\.\_\-]*@[[:alnum:]\.\-]*" test.txt | sort | uniq

The problem with this solution is that it covers only simple cases. The real-world examples of e-mail addresses are not always valid from the computer point of view, are frequently written for human eyes, and may be protected against automatic scraping. Let's test this command with the following text:

A simple email may look like test@test.com, but.. it may also contain Unicode: e.g. test@тест.укр, or be protected with anti-scraping techniques: test(at)test(dot)com, test @ test . com, test@test[punct]com, etc.

The result will be:

@
test@test
test@test.com
test@тест.укр

which is not exactly what we need.

We can try to fix this regular expression for all kinds of real-world examples, but it will quickly become too heavy and hard to support. Instead, let's implement a solution based on RL3 patterns (refer to RL3 Installation Guide).

In RL3 you can define your own named patterns and use them in other patterns created by you. The concept is similar to functions in other mainstream programming languages.

Let's define an email pattern:

pattern EMAIL [icase]
    ({EMAIL_Name}{EMAIL_At}{EMAIL_DomainLabel}{EMAIL_Dot}{TLD})

This pattern defines an email by following its natural structure. An email address consists of:

  • email name -- the email part in email@domain.com
  • @ symbol, which also can be (at), [at], etc.
  • domain label -- the domain part in email@domain.com
  • . symbol, which also can be (dot), [punct], etc.
  • domain extension or Top Level Domain name (TLD) -- the com part in email@domain.com

For @ and . we can define patterns EMAIL_At and EMAIL_Dot as follows:

pattern EMAIL_Wrapper(x)
    (
         \s?\{\s?{x}\s?\}\s?
        |\s?\[\s?{x}\s?\]\s?
        |\s?<\s?{x}\s?>\s?
        |\s?\(\s?{x}\s?\)\s?
        |\s?\(\s?\(\s?{x}\s?\)\s?\)\s?
        |\s?\(\s?<\s?{x}\s?>\s?\)\s?
        |\s-{x}-\s
    )
pattern EMAIL_At [icase]
    (
         \s?@\s?
        |\*@\*
        |{EMAIL_Wrapper (@|a|ä|at|a\st|ät|ä\st|att|ätt|miukumauku|se\skumma\skiemura|ätmerkki|_at_|_ät_|miuku|ät-merkki)}
    )

pattern EMAIL_Dot [icase]
    (
         \s?\.\s?
        |\(\.\)
        |{EMAIL_Wrapper (dot|punkt|punct|piste)}
    )

Notice the usage of the template pattern EMAIL_Wrapper(x) which will produce new patterns based on its definition and the given input x (i.e. will simply substitute all {x} references with the given value, and generate new patterns during the compilation process).

This way we have patterns that can match all kinds of anti-scraping protections like (at), {at}, (dot), [punct], etc.

Patterns for an email name and domain label can be defined as follows:

pattern EMAIL_Name [icase]
    ({token}(([+_\-]|{EMAIL_Dot}){token}){0,5})

pattern EMAIL_DomainLabel [icase]
    ({token}((-|{EMAIL_Dot}){token}){0,5})

Both patterns are very similar and can be interpreted as: a sequence of tokens separated with allowed dividers (+, _, -, {EMAIL_Dot} in case of an email name; and -, {EMAIL_Dot} in case of a domain label). Note: the token pattern is a built-in pattern.

And finally, a TLD pattern is defined as:

pattern TLD [icase]
    {dawg tld}

which matches entries from a built-in tld dictionary.

All patterns described above are already defined in the RL3 StdLib module email.rl3 and can be used in patterns passed to the RL3 Grep tool:

$ rl3grep -o "{EMAIL}" test.txt | sort | uniq

which will produce the following output:

test(at)test(dot)com
test @ test . com
test@test.com
test@test[punct]com
test@тест.укр

Links to other help pages