The function clean_phone() cleans and standardizes a DataFrame column containing phone numbers. The function validate_phone() validates either a single phone number or a column of phone numbers, returning True if the value is valid, and False otherwise.
clean_phone()
validate_phone()
Currently, Canadian/US phone numbers having the following format are supported as valid input:
Country code of “1” (optional)
Three-digit area code (optional)
Three-digit central office code
Four-digit station code
Extension number preceded by “#”, “x”, “ext”, or “extension” (optional)
A combination of numbers and uppercase letters is allowed within the central office code and the station code.
Various delimiters between the digits are also allowed, such as spaces, hyphens, periods, brackets, and/or forward slashes.
Phone numbers can be converted to the following formats via the output_format parameter:
output_format
North American Numbering Plan (nanp): NPA-NXX-XXXX
E.164 (e164): +1NPANXXXXXX
national: (NPA) NXX-XXXX
Invalid parsing is handled with the errors parameter:
errors
“coerce” (default): invalid parsing will be set to NaN
“ignore”: invalid parsing will return the input
“raise”: invalid parsing will raise an exception
After cleaning, a report is printed that provides the following information:
How many values were cleaned (the value must have been transformed).
How many values could not be parsed.
A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.
The following sections demonstrate the functionality of clean_phone() and validate_phone().
[1]:
import pandas as pd import numpy as np df = pd.DataFrame({ "phone": [ "555-234-5678", "(555) 234-5678", "555.234.5678", "555/234/5678", 15551234567, "(1) 555-234-5678", "+1 (234) 567-8901 x. 1234", "2345678901 extension 1234", "2345678", "800-299-JUNK", "1-866-4ZIPCAR", "123 ABC COMPANY", "+66 91 889 8948", "hello", np.nan, "NULL" ] }) df
By default, the output_format parameter is set to “nanp” (NPA-NXX-XXXX) and the errors parameter is set to “coerce” (set to NaN when parsing is invalid).
[2]:
from dataprep.clean import clean_phone clean_phone(df, "phone")
Phone Number Cleaning Report: 10 values cleaned (62.5%) 3 values unable to be parsed (18.75%), set to NaN Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)
Note that “555-234-5678” is considered not cleaned in the report since its resulting format is the same as the input. Also, “+66 91 889 8948” is invalid because it is not a Canadian or US phone number.
The letters in “800-299-JUNK” and “1-866-4ZIPCAR” are automatically converted to their number equivalents on a telephone keypad.
This section demonstrates the supported phone number formats.
[3]:
clean_phone(df, "phone", output_format="e164")
Note that the country code “+1” is not added to “2345678” as this would result in an invalid Canadian or US phone number.
[4]:
clean_phone(df, "phone", output_format="national")
split
The split parameter adds individual columns containing the cleaned phone number values to the given DataFrame.
[5]:
clean_phone(df, "phone", split=True)
Phone Number Cleaning Report: 11 values cleaned (68.75%) 3 values unable to be parsed (18.75%), set to NaN Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)
fix_missing
By default, the fix_missing parameter is set to “empty” (leave the missing country code as is). If set to “auto”, the country code is set to “1”.
[6]:
clean_phone(df, "phone", split=True, fix_missing="auto")
Again, note that the country code is not set to “1” for “2345678” as this would result in an invalid Canadian or US phone number.
inplace
This deletes the given column from the returned DataFrame. A new column containing cleaned phone numbers is added with a title in the format "{original title}_clean".
"{original title}_clean"
[7]:
clean_phone(df, "phone", inplace=True)
[8]:
clean_phone(df, "phone", split=True, inplace=True)
[9]:
clean_phone(df, "phone", split=True, inplace=True, fix_missing="auto")
validate_phone() returns True when the input is a valid phone number. Otherwise it returns False. Valid types are the same as clean_phone().
[10]:
from dataprep.clean import validate_phone print(validate_phone(1234)) print(validate_phone(2346789)) print(validate_phone("1 800 234 6789")) print(validate_phone("+44 7700 900077")) print(validate_phone("555-234-6789 ext 32")) print(validate_phone("1-866-4ZIPCAR")) print(validate_phone("123 ABC COMPANY"))
False True True False True True False
[11]:
df = pd.DataFrame({ "phone": [ "555-234-5678", "(555) 234-5678", "555.234.5678", "555/234/5678", 15551234567, "(1) 555-234-5678", "+1 (234) 567-8901 x. 1234", "2345678901 extension 1234", "2345678", "800-299-JUNK", "1-866-4ZIPCAR", "123 ABC COMPANY", "+66 91 889 8948", "hello", np.nan, "NULL" ] }) df["valid"] = validate_phone(df["phone"]) df