Phone Numbers¶
Introduction¶
The function clean_phone()
cleans and standardizes a DataFrame column containing phone numbers. The function validate_phone()
validates either a single phone number or a column of phone numbers, returning True if the value is valid, and False otherwise.
Currently, Canadian/US phone numbers having the following format are supported as valid input:
Country code of “1” (optional)
Three-digit area code (optional)
Three-digit central office code
Four-digit station code
Extension number preceded by “#”, “x”, “ext”, or “extension” (optional)
A combination of numbers and uppercase letters is allowed within the central office code and the station code.
Various delimiters between the digits are also allowed, such as spaces, hyphens, periods, brackets, and/or forward slashes.
Phone numbers can be converted to the following formats via the output_format
parameter:
North American Numbering Plan (nanp): NPA-NXX-XXXX
E.164 (e164): +1NPANXXXXXX
national: (NPA) NXX-XXXX
Invalid parsing is handled with the errors
parameter:
“coerce” (default): invalid parsing will be set to NaN
“ignore”: invalid parsing will return the input
“raise”: invalid parsing will raise an exception
After cleaning, a report is printed that provides the following information:
How many values were cleaned (the value must have been transformed).
How many values could not be parsed.
A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.
The following sections demonstrate the functionality of clean_phone()
and validate_phone()
.
An example dataset containing phone numbers¶
[1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"phone": [
"555-234-5678", "(555) 234-5678", "555.234.5678", "555/234/5678",
15551234567, "(1) 555-234-5678", "+1 (234) 567-8901 x. 1234",
"2345678901 extension 1234", "2345678", "800-299-JUNK", "1-866-4ZIPCAR",
"123 ABC COMPANY", "+66 91 889 8948", "hello", np.nan, "NULL"
]
})
df
[1]:
phone | |
---|---|
0 | 555-234-5678 |
1 | (555) 234-5678 |
2 | 555.234.5678 |
3 | 555/234/5678 |
4 | 15551234567 |
5 | (1) 555-234-5678 |
6 | +1 (234) 567-8901 x. 1234 |
7 | 2345678901 extension 1234 |
8 | 2345678 |
9 | 800-299-JUNK |
10 | 1-866-4ZIPCAR |
11 | 123 ABC COMPANY |
12 | +66 91 889 8948 |
13 | hello |
14 | NaN |
15 | NULL |
1. Default clean_phone()
¶
By default, the output_format
parameter is set to “nanp” (NPA-NXX-XXXX) and the errors
parameter is set to “coerce” (set to NaN when parsing is invalid).
[2]:
from dataprep.clean import clean_phone
clean_phone(df, "phone")
Phone Number Cleaning Report:
10 values cleaned (62.5%)
3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)
[2]:
phone | phone_clean | |
---|---|---|
0 | 555-234-5678 | 555-234-5678 |
1 | (555) 234-5678 | 555-234-5678 |
2 | 555.234.5678 | 555-234-5678 |
3 | 555/234/5678 | 555-234-5678 |
4 | 15551234567 | 555-123-4567 |
5 | (1) 555-234-5678 | 555-234-5678 |
6 | +1 (234) 567-8901 x. 1234 | 234-567-8901 ext. 1234 |
7 | 2345678901 extension 1234 | 234-567-8901 ext. 1234 |
8 | 2345678 | 234-5678 |
9 | 800-299-JUNK | 800-299-5865 |
10 | 1-866-4ZIPCAR | 866-494-7227 |
11 | 123 ABC COMPANY | NaN |
12 | +66 91 889 8948 | NaN |
13 | hello | NaN |
14 | NaN | NaN |
15 | NULL | NaN |
Note that “555-234-5678” is considered not cleaned in the report since its resulting format is the same as the input. Also, “+66 91 889 8948” is invalid because it is not a Canadian or US phone number.
The letters in “800-299-JUNK” and “1-866-4ZIPCAR” are automatically converted to their number equivalents on a telephone keypad.
2. Output formats¶
This section demonstrates the supported phone number formats.
E.164 (e164)¶
[3]:
clean_phone(df, "phone", output_format="e164")
Phone Number Cleaning Report:
10 values cleaned (62.5%)
3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)
[3]:
phone | phone_clean | |
---|---|---|
0 | 555-234-5678 | +15552345678 |
1 | (555) 234-5678 | +15552345678 |
2 | 555.234.5678 | +15552345678 |
3 | 555/234/5678 | +15552345678 |
4 | 15551234567 | +15551234567 |
5 | (1) 555-234-5678 | +15552345678 |
6 | +1 (234) 567-8901 x. 1234 | +12345678901 ext. 1234 |
7 | 2345678901 extension 1234 | +12345678901 ext. 1234 |
8 | 2345678 | 2345678 |
9 | 800-299-JUNK | +18002995865 |
10 | 1-866-4ZIPCAR | +18664947227 |
11 | 123 ABC COMPANY | NaN |
12 | +66 91 889 8948 | NaN |
13 | hello | NaN |
14 | NaN | NaN |
15 | NULL | NaN |
Note that the country code “+1” is not added to “2345678” as this would result in an invalid Canadian or US phone number.
national¶
[4]:
clean_phone(df, "phone", output_format="national")
Phone Number Cleaning Report:
10 values cleaned (62.5%)
3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)
[4]:
phone | phone_clean | |
---|---|---|
0 | 555-234-5678 | (555) 234-5678 |
1 | (555) 234-5678 | (555) 234-5678 |
2 | 555.234.5678 | (555) 234-5678 |
3 | 555/234/5678 | (555) 234-5678 |
4 | 15551234567 | (555) 123-4567 |
5 | (1) 555-234-5678 | (555) 234-5678 |
6 | +1 (234) 567-8901 x. 1234 | (234) 567-8901 ext. 1234 |
7 | 2345678901 extension 1234 | (234) 567-8901 ext. 1234 |
8 | 2345678 | 234-5678 |
9 | 800-299-JUNK | (800) 299-5865 |
10 | 1-866-4ZIPCAR | (866) 494-7227 |
11 | 123 ABC COMPANY | NaN |
12 | +66 91 889 8948 | NaN |
13 | hello | NaN |
14 | NaN | NaN |
15 | NULL | NaN |
3. split
parameter¶
The split
parameter adds individual columns containing the cleaned phone number values to the given DataFrame.
[5]:
clean_phone(df, "phone", split=True)
Phone Number Cleaning Report:
11 values cleaned (68.75%)
3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)
[5]:
phone | country_code | area_code | office_code | station_code | ext_num | |
---|---|---|---|---|---|---|
0 | 555-234-5678 | NaN | 555 | 234 | 5678 | NaN |
1 | (555) 234-5678 | NaN | 555 | 234 | 5678 | NaN |
2 | 555.234.5678 | NaN | 555 | 234 | 5678 | NaN |
3 | 555/234/5678 | NaN | 555 | 234 | 5678 | NaN |
4 | 15551234567 | 1 | 555 | 123 | 4567 | NaN |
5 | (1) 555-234-5678 | 1 | 555 | 234 | 5678 | NaN |
6 | +1 (234) 567-8901 x. 1234 | 1 | 234 | 567 | 8901 | 1234 |
7 | 2345678901 extension 1234 | NaN | 234 | 567 | 8901 | 1234 |
8 | 2345678 | NaN | NaN | 234 | 5678 | NaN |
9 | 800-299-JUNK | NaN | 800 | 299 | 5865 | NaN |
10 | 1-866-4ZIPCAR | 1 | 866 | 494 | 7227 | NaN |
11 | 123 ABC COMPANY | NaN | NaN | NaN | NaN | NaN |
12 | +66 91 889 8948 | NaN | NaN | NaN | NaN | NaN |
13 | hello | NaN | NaN | NaN | NaN | NaN |
14 | NaN | NaN | NaN | NaN | NaN | NaN |
15 | NULL | NaN | NaN | NaN | NaN | NaN |
4. fix_missing
parameter¶
By default, the fix_missing
parameter is set to “empty” (leave the missing country code as is). If set to “auto”, the country code is set to “1”.
split
and fix_missing
¶
[6]:
clean_phone(df, "phone", split=True, fix_missing="auto")
Phone Number Cleaning Report:
11 values cleaned (68.75%)
3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)
[6]:
phone | country_code | area_code | office_code | station_code | ext_num | |
---|---|---|---|---|---|---|
0 | 555-234-5678 | 1 | 555 | 234 | 5678 | NaN |
1 | (555) 234-5678 | 1 | 555 | 234 | 5678 | NaN |
2 | 555.234.5678 | 1 | 555 | 234 | 5678 | NaN |
3 | 555/234/5678 | 1 | 555 | 234 | 5678 | NaN |
4 | 15551234567 | 1 | 555 | 123 | 4567 | NaN |
5 | (1) 555-234-5678 | 1 | 555 | 234 | 5678 | NaN |
6 | +1 (234) 567-8901 x. 1234 | 1 | 234 | 567 | 8901 | 1234 |
7 | 2345678901 extension 1234 | 1 | 234 | 567 | 8901 | 1234 |
8 | 2345678 | NaN | NaN | 234 | 5678 | NaN |
9 | 800-299-JUNK | 1 | 800 | 299 | 5865 | NaN |
10 | 1-866-4ZIPCAR | 1 | 866 | 494 | 7227 | NaN |
11 | 123 ABC COMPANY | NaN | NaN | NaN | NaN | NaN |
12 | +66 91 889 8948 | NaN | NaN | NaN | NaN | NaN |
13 | hello | NaN | NaN | NaN | NaN | NaN |
14 | NaN | NaN | NaN | NaN | NaN | NaN |
15 | NULL | NaN | NaN | NaN | NaN | NaN |
Again, note that the country code is not set to “1” for “2345678” as this would result in an invalid Canadian or US phone number.
5. inplace
parameter¶
This deletes the given column from the returned DataFrame. A new column containing cleaned phone numbers is added with a title in the format "{original title}_clean"
.
[7]:
clean_phone(df, "phone", inplace=True)
Phone Number Cleaning Report:
10 values cleaned (62.5%)
3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)
[7]:
phone_clean | |
---|---|
0 | 555-234-5678 |
1 | 555-234-5678 |
2 | 555-234-5678 |
3 | 555-234-5678 |
4 | 555-123-4567 |
5 | 555-234-5678 |
6 | 234-567-8901 ext. 1234 |
7 | 234-567-8901 ext. 1234 |
8 | 234-5678 |
9 | 800-299-5865 |
10 | 866-494-7227 |
11 | NaN |
12 | NaN |
13 | NaN |
14 | NaN |
15 | NaN |
inplace
and split
¶
[8]:
clean_phone(df, "phone", split=True, inplace=True)
Phone Number Cleaning Report:
11 values cleaned (68.75%)
3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)
[8]:
country_code | area_code | office_code | station_code | ext_num | |
---|---|---|---|---|---|
0 | NaN | 555 | 234 | 5678 | NaN |
1 | NaN | 555 | 234 | 5678 | NaN |
2 | NaN | 555 | 234 | 5678 | NaN |
3 | NaN | 555 | 234 | 5678 | NaN |
4 | 1 | 555 | 123 | 4567 | NaN |
5 | 1 | 555 | 234 | 5678 | NaN |
6 | 1 | 234 | 567 | 8901 | 1234 |
7 | NaN | 234 | 567 | 8901 | 1234 |
8 | NaN | NaN | 234 | 5678 | NaN |
9 | NaN | 800 | 299 | 5865 | NaN |
10 | 1 | 866 | 494 | 7227 | NaN |
11 | NaN | NaN | NaN | NaN | NaN |
12 | NaN | NaN | NaN | NaN | NaN |
13 | NaN | NaN | NaN | NaN | NaN |
14 | NaN | NaN | NaN | NaN | NaN |
15 | NaN | NaN | NaN | NaN | NaN |
inplace
, split
and fix_missing
¶
[9]:
clean_phone(df, "phone", split=True, inplace=True, fix_missing="auto")
Phone Number Cleaning Report:
11 values cleaned (68.75%)
3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)
[9]:
country_code | area_code | office_code | station_code | ext_num | |
---|---|---|---|---|---|
0 | 1 | 555 | 234 | 5678 | NaN |
1 | 1 | 555 | 234 | 5678 | NaN |
2 | 1 | 555 | 234 | 5678 | NaN |
3 | 1 | 555 | 234 | 5678 | NaN |
4 | 1 | 555 | 123 | 4567 | NaN |
5 | 1 | 555 | 234 | 5678 | NaN |
6 | 1 | 234 | 567 | 8901 | 1234 |
7 | 1 | 234 | 567 | 8901 | 1234 |
8 | NaN | NaN | 234 | 5678 | NaN |
9 | 1 | 800 | 299 | 5865 | NaN |
10 | 1 | 866 | 494 | 7227 | NaN |
11 | NaN | NaN | NaN | NaN | NaN |
12 | NaN | NaN | NaN | NaN | NaN |
13 | NaN | NaN | NaN | NaN | NaN |
14 | NaN | NaN | NaN | NaN | NaN |
15 | NaN | NaN | NaN | NaN | NaN |
6. validate_phone()
¶
validate_phone()
returns True when the input is a valid phone number. Otherwise it returns False. Valid types are the same as clean_phone()
.
[10]:
from dataprep.clean import validate_phone
print(validate_phone(1234))
print(validate_phone(2346789))
print(validate_phone("1 800 234 6789"))
print(validate_phone("+44 7700 900077"))
print(validate_phone("555-234-6789 ext 32"))
print(validate_phone("1-866-4ZIPCAR"))
print(validate_phone("123 ABC COMPANY"))
False
True
True
False
True
True
False
[11]:
df = pd.DataFrame({
"phone": [
"555-234-5678", "(555) 234-5678", "555.234.5678", "555/234/5678",
15551234567, "(1) 555-234-5678", "+1 (234) 567-8901 x. 1234",
"2345678901 extension 1234", "2345678", "800-299-JUNK", "1-866-4ZIPCAR",
"123 ABC COMPANY", "+66 91 889 8948", "hello", np.nan, "NULL"
]
})
df["valid"] = validate_phone(df["phone"])
df
[11]:
phone | valid | |
---|---|---|
0 | 555-234-5678 | True |
1 | (555) 234-5678 | True |
2 | 555.234.5678 | True |
3 | 555/234/5678 | True |
4 | 15551234567 | True |
5 | (1) 555-234-5678 | True |
6 | +1 (234) 567-8901 x. 1234 | True |
7 | 2345678901 extension 1234 | True |
8 | 2345678 | True |
9 | 800-299-JUNK | True |
10 | 1-866-4ZIPCAR | True |
11 | 123 ABC COMPANY | False |
12 | +66 91 889 8948 | False |
13 | hello | False |
14 | NaN | False |
15 | NULL | False |