Email Addresses

Introduction

The function clean_email() cleans and standardizes a DataFrame column containing email addresses. The function validate_email() validates either a single email address or a column containing email addresses, returning True if the email address is valid and False otherwise.

To remove all whitespace from the input value before cleaning it, you can set the parameter remove_whitespace to True. This will clean an invalid email address like “hello @example.org” to “hello@example.org”.

The parameter fix_domain will try to correct typos in the email address’s domain when set to True. The first valid domain found will be returned. It employs four strategies to fix a domain:

  • Swap neighboring characters. This will fix “gmali.com” to “gmail.com”.

  • Add a single character. This will fix “gmal.com” to “gmail.com”.

  • Remove a single character. This will fix “gmails.com” to “gmail.com”.

  • Swap each character with its nearby keys on the qwerty keyboard. This will fix “gmqil.com” to “gmail.com”.

You can split the column of email addresses into one column for the usernames and another for the domains by setting the parameter split to True.

Invalid parsing is handled with the errors parameter:

  • “coerce” (default): invalid parsing will be set to NaN

  • “ignore”: invalid parsing will return the input

  • “raise”: invalid parsing will raise an exception

After cleaning, a report is printed that provides the following information:

  • How many values were cleaned (the value must have been transformed).

  • How many values could not be parsed.

  • A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.

The following sections demonstrate the functionality of clean_email() and validate_email().

An example dataset with email addresses

[1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({
    "email": [
        "yi@gmali.com", "yi@sfu.ca", "y i@sfu.ca", "Yi@gmail.com",
        "H ELLO@hotmal.COM", "hello", np.nan, "NULL"
    ]
})
df
[1]:
email
0 yi@gmali.com
1 yi@sfu.ca
2 y i@sfu.ca
3 Yi@gmail.com
4 H ELLO@hotmal.COM
5 hello
6 NaN
7 NULL

1. Default clean_email()

By default, clean_email() will do a strict check to determine if an email address is in the correct format and set invalid values to NaN.

[2]:
from dataprep.clean import clean_email
clean_email(df, "email")
email Cleaning Report:
        1 values cleaned (12.5%)
        3 values unable to be parsed (37.5%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)
[2]:
email email_clean
0 yi@gmali.com yi@gmali.com
1 yi@sfu.ca yi@sfu.ca
2 y i@sfu.ca NaN
3 Yi@gmail.com yi@gmail.com
4 H ELLO@hotmal.COM NaN
5 hello NaN
6 NaN NaN
7 NULL NaN

2. split parameter

By setting the split parameter to True, the returned table will contain separate columns for the domain and username of valid emails.

[3]:
clean_email(df, "email", split=True)
email Cleaning Report:
        1 values cleaned (12.5%)
        3 values unable to be parsed (37.5%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)
[3]:
email username domain
0 yi@gmali.com yi gmali.com
1 yi@sfu.ca yi sfu.ca
2 y i@sfu.ca NaN NaN
3 Yi@gmail.com yi gmail.com
4 H ELLO@hotmal.COM NaN NaN
5 hello NaN NaN
6 NaN NaN NaN
7 NULL NaN NaN

3. remove_whitespace parameter

When the remove_whitespace parameter is set to True, whitespace will be removed before checking if an email is valid.

[4]:
clean_email(df, "email", remove_whitespace=True)
email Cleaning Report:
        2 values cleaned (25.0%)
        1 values unable to be parsed (12.5%), set to NaN
Result contains 5 (62.5%) values in the correct format and 3 null values (37.5%)
[4]:
email email_clean
0 yi@gmali.com yi@gmali.com
1 yi@sfu.ca yi@sfu.ca
2 y i@sfu.ca yi@sfu.ca
3 Yi@gmail.com yi@gmail.com
4 H ELLO@hotmal.COM hello@hotmal.com
5 hello NaN
6 NaN NaN
7 NULL NaN

4. fix_domain parameter

When the fix_domain parameter is set to True, clean_email() will try to correct invalid domains.

[5]:
clean_email(df, "email", fix_domain=True)
email Cleaning Report:
        2 values cleaned (25.0%)
        3 values unable to be parsed (37.5%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)
[5]:
email email_clean
0 yi@gmali.com yi@gmail.com
1 yi@sfu.ca yi@sfu.ca
2 y i@sfu.ca NaN
3 Yi@gmail.com yi@gmail.com
4 H ELLO@hotmal.COM NaN
5 hello NaN
6 NaN NaN
7 NULL NaN

5. error parameter

When errors="ignore", invalid emails will be left unchanged in the output

[6]:
clean_email(df, "email", errors="ignore")
email Cleaning Report:
        1 values cleaned (12.5%)
        3 values unable to be parsed (37.5%), left unchanged
Result contains 3 (37.5%) values in the correct format and 2 null values (25.0%)
[6]:
email email_clean
0 yi@gmali.com yi@gmali.com
1 yi@sfu.ca yi@sfu.ca
2 y i@sfu.ca y i@sfu.ca
3 Yi@gmail.com yi@gmail.com
4 H ELLO@hotmal.COM H ELLO@hotmal.COM
5 hello hello
6 NaN NaN
7 NULL NULL

6. validate_email()

The function validate_email() returns True if an email address is valid and False otherwise. It can be applied on a string or a column of email addresses.

[7]:
from dataprep.clean import validate_email
print(validate_email('Abc.example.com'))
print(validate_email('prettyandsimple@example.com'))
print(validate_email('disposable.style.email.with+symbol@example.com'))
print(validate_email('this is"not\allowed@example.com'))
False
True
True
False
[8]:
validate_email(df["email"])
[8]:
0     True
1     True
2    False
3     True
4    False
5    False
6    False
7    False
Name: email, dtype: bool

Note that validate_email() will do the strict semantic check by default.