Email Addresses¶

Introduction¶

The function clean_email() cleans and standardizes a DataFrame column containing email addresses. The function validate_email() validates either a single email address or a column containing email addresses, returning True if the email address is valid and False otherwise.

To remove all whitespace from the input value before cleaning it, you can set the parameter remove_whitespace to True. This will clean an invalid email address like “hello @example.org” to “hello@example.org”.

The parameter fix_domain will try to correct typos in the email address’s domain when set to True. The first valid domain found will be returned. It employs four strategies to fix a domain:

Swap neighboring characters. This will fix “gmali.com” to “gmail.com”.
Add a single character. This will fix “gmal.com” to “gmail.com”.
Remove a single character. This will fix “gmails.com” to “gmail.com”.
Swap each character with its nearby keys on the qwerty keyboard. This will fix “gmqil.com” to “gmail.com”.

You can split the column of email addresses into one column for the usernames and another for the domains by setting the parameter split to True.

Invalid parsing is handled with the errors parameter:

“coerce” (default): invalid parsing will be set to NaN
“ignore”: invalid parsing will return the input
“raise”: invalid parsing will raise an exception

After cleaning, a report is printed that provides the following information:

How many values were cleaned (the value must have been transformed).
How many values could not be parsed.
A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.

The following sections demonstrate the functionality of clean_email() and validate_email().

An example dataset with email addresses¶

[1]:

import pandas as pd
import numpy as np
df = pd.DataFrame({
    "email": [
        "yi@gmali.com", "yi@sfu.ca", "y i@sfu.ca", "Yi@gmail.com",
        "H ELLO@hotmal.COM", "hello", np.nan, "NULL"
    ]
})
df

[1]:

	email
0	yi@gmali.com
1	yi@sfu.ca
2	y i@sfu.ca
3	Yi@gmail.com
4	H ELLO@hotmal.COM
5	hello
6	NaN
7	NULL

1. Default clean_email()¶

By default, clean_email() will do a strict check to determine if an email address is in the correct format and set invalid values to NaN.

[2]:

from dataprep.clean import clean_email
clean_email(df, "email")

email Cleaning Report:
        1 values cleaned (12.5%)
        3 values unable to be parsed (37.5%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)

[2]:

	email	email_clean
0	yi@gmali.com	yi@gmali.com
1	yi@sfu.ca	yi@sfu.ca
2	y i@sfu.ca	NaN
3	Yi@gmail.com	yi@gmail.com
4	H ELLO@hotmal.COM	NaN
5	hello	NaN
6	NaN	NaN
7	NULL	NaN

2. `split` parameter¶

By setting the split parameter to True, the returned table will contain separate columns for the domain and username of valid emails.

[3]:

clean_email(df, "email", split=True)

email Cleaning Report:
        1 values cleaned (12.5%)
        3 values unable to be parsed (37.5%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)

[3]:

	email	username	domain
0	yi@gmali.com	yi	gmali.com
1	yi@sfu.ca	yi	sfu.ca
2	y i@sfu.ca	NaN	NaN
3	Yi@gmail.com	yi	gmail.com
4	H ELLO@hotmal.COM	NaN	NaN
5	hello	NaN	NaN
6	NaN	NaN	NaN
7	NULL	NaN	NaN

3. `remove_whitespace` parameter¶

When the remove_whitespace parameter is set to True, whitespace will be removed before checking if an email is valid.

[4]:

clean_email(df, "email", remove_whitespace=True)

email Cleaning Report:
        2 values cleaned (25.0%)
        1 values unable to be parsed (12.5%), set to NaN
Result contains 5 (62.5%) values in the correct format and 3 null values (37.5%)

[4]:

	email	email_clean
0	yi@gmali.com	yi@gmali.com
1	yi@sfu.ca	yi@sfu.ca
2	y i@sfu.ca	yi@sfu.ca
3	Yi@gmail.com	yi@gmail.com
4	H ELLO@hotmal.COM	hello@hotmal.com
5	hello	NaN
6	NaN	NaN
7	NULL	NaN

4. `fix_domain` parameter¶

When the fix_domain parameter is set to True, clean_email() will try to correct invalid domains.

[5]:

clean_email(df, "email", fix_domain=True)

email Cleaning Report:
        2 values cleaned (25.0%)
        3 values unable to be parsed (37.5%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)

[5]:

	email	email_clean
0	yi@gmali.com	yi@gmail.com
1	yi@sfu.ca	yi@sfu.ca
2	y i@sfu.ca	NaN
3	Yi@gmail.com	yi@gmail.com
4	H ELLO@hotmal.COM	NaN
5	hello	NaN
6	NaN	NaN
7	NULL	NaN

5. `error` parameter¶

When errors="ignore", invalid emails will be left unchanged in the output

[6]:

clean_email(df, "email", errors="ignore")

email Cleaning Report:
        1 values cleaned (12.5%)
        3 values unable to be parsed (37.5%), left unchanged
Result contains 3 (37.5%) values in the correct format and 2 null values (25.0%)

[6]:

	email	email_clean
0	yi@gmali.com	yi@gmali.com
1	yi@sfu.ca	yi@sfu.ca
2	y i@sfu.ca	y i@sfu.ca
3	Yi@gmail.com	yi@gmail.com
4	H ELLO@hotmal.COM	H ELLO@hotmal.COM
5	hello	hello
6	NaN	NaN
7	NULL	NULL

6. `validate_email()`¶

The function validate_email() returns True if an email address is valid and False otherwise. It can be applied on a string or a column of email addresses.

[7]:

from dataprep.clean import validate_email
print(validate_email('Abc.example.com'))
print(validate_email('prettyandsimple@example.com'))
print(validate_email('disposable.style.email.with+symbol@example.com'))
print(validate_email('this is"not\allowed@example.com'))

False
True
True
False

[8]:

validate_email(df["email"])

[8]:

0     True
1     True
2    False
3     True
4    False
5    False
6    False
7    False
Name: email, dtype: bool

Note that validate_email() will do the strict semantic check by default.

Email Addresses¶

Introduction¶

An example dataset with email addresses¶

1. Default clean_email()¶

2. split parameter¶

3. remove_whitespace parameter¶

4. fix_domain parameter¶

5. error parameter¶

6. validate_email()¶

2. `split` parameter¶

3. `remove_whitespace` parameter¶

4. `fix_domain` parameter¶

5. `error` parameter¶

6. `validate_email()`¶