The function clean_email() cleans and standardizes a DataFrame column containing email addresses. The function validate_email() validates either a single email address or a column containing email addresses, returning True if the email address is valid and False otherwise.
clean_email()
validate_email()
To remove all whitespace from the input value before cleaning it, you can set the parameter remove_whitespace to True. This will clean an invalid email address like “hello @example.org” to “hello@example.org”.
remove_whitespace
The parameter fix_domain will try to correct typos in the email address’s domain when set to True. The first valid domain found will be returned. It employs four strategies to fix a domain:
fix_domain
Swap neighboring characters. This will fix “gmali.com” to “gmail.com”.
Add a single character. This will fix “gmal.com” to “gmail.com”.
Remove a single character. This will fix “gmails.com” to “gmail.com”.
Swap each character with its nearby keys on the qwerty keyboard. This will fix “gmqil.com” to “gmail.com”.
You can split the column of email addresses into one column for the usernames and another for the domains by setting the parameter split to True.
split
Invalid parsing is handled with the errors parameter:
errors
“coerce” (default): invalid parsing will be set to NaN
“ignore”: invalid parsing will return the input
“raise”: invalid parsing will raise an exception
After cleaning, a report is printed that provides the following information:
How many values were cleaned (the value must have been transformed).
How many values could not be parsed.
A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.
The following sections demonstrate the functionality of clean_email() and validate_email().
[1]:
import pandas as pd import numpy as np df = pd.DataFrame({ "email": [ "yi@gmali.com", "yi@sfu.ca", "y i@sfu.ca", "Yi@gmail.com", "H ELLO@hotmal.COM", "hello", np.nan, "NULL" ] }) df
By default, clean_email() will do a strict check to determine if an email address is in the correct format and set invalid values to NaN.
[2]:
from dataprep.clean import clean_email clean_email(df, "email")
email Cleaning Report: 1 values cleaned (12.5%) 3 values unable to be parsed (37.5%), set to NaN Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)
By setting the split parameter to True, the returned table will contain separate columns for the domain and username of valid emails.
[3]:
clean_email(df, "email", split=True)
When the remove_whitespace parameter is set to True, whitespace will be removed before checking if an email is valid.
[4]:
clean_email(df, "email", remove_whitespace=True)
email Cleaning Report: 2 values cleaned (25.0%) 1 values unable to be parsed (12.5%), set to NaN Result contains 5 (62.5%) values in the correct format and 3 null values (37.5%)
When the fix_domain parameter is set to True, clean_email() will try to correct invalid domains.
[5]:
clean_email(df, "email", fix_domain=True)
email Cleaning Report: 2 values cleaned (25.0%) 3 values unable to be parsed (37.5%), set to NaN Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)
error
When errors="ignore", invalid emails will be left unchanged in the output
errors="ignore"
[6]:
clean_email(df, "email", errors="ignore")
email Cleaning Report: 1 values cleaned (12.5%) 3 values unable to be parsed (37.5%), left unchanged Result contains 3 (37.5%) values in the correct format and 2 null values (25.0%)
The function validate_email() returns True if an email address is valid and False otherwise. It can be applied on a string or a column of email addresses.
[7]:
from dataprep.clean import validate_email print(validate_email('Abc.example.com')) print(validate_email('prettyandsimple@example.com')) print(validate_email('disposable.style.email.with+symbol@example.com')) print(validate_email('this is"not\allowed@example.com'))
False True True False
[8]:
validate_email(df["email"])
0 True 1 True 2 False 3 True 4 False 5 False 6 False 7 False Name: email, dtype: bool
Note that validate_email() will do the strict semantic check by default.