Phone Numbers¶

Introduction¶

The function clean_phone() cleans and standardizes a DataFrame column containing phone numbers. The function validate_phone() validates either a single phone number or a column of phone numbers, returning True if the value is valid, and False otherwise.

Currently, Canadian/US phone numbers having the following format are supported as valid input:

Country code of “1” (optional)
Three-digit area code (optional)
Three-digit central office code
Four-digit station code
Extension number preceded by “#”, “x”, “ext”, or “extension” (optional)

A combination of numbers and uppercase letters is allowed within the central office code and the station code.

Various delimiters between the digits are also allowed, such as spaces, hyphens, periods, brackets, and/or forward slashes.

Phone numbers can be converted to the following formats via the output_format parameter:

North American Numbering Plan (nanp): NPA-NXX-XXXX
E.164 (e164): +1NPANXXXXXX
national: (NPA) NXX-XXXX

Invalid parsing is handled with the errors parameter:

“coerce” (default): invalid parsing will be set to NaN
“ignore”: invalid parsing will return the input
“raise”: invalid parsing will raise an exception

After cleaning, a report is printed that provides the following information:

How many values were cleaned (the value must have been transformed).
How many values could not be parsed.
A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.

The following sections demonstrate the functionality of clean_phone() and validate_phone().

An example dataset containing phone numbers¶

[1]:

import pandas as pd
import numpy as np
df = pd.DataFrame({
    "phone": [
        "555-234-5678", "(555) 234-5678", "555.234.5678", "555/234/5678",
        15551234567, "(1) 555-234-5678", "+1 (234) 567-8901 x. 1234",
        "2345678901 extension 1234", "2345678", "800-299-JUNK", "1-866-4ZIPCAR",
        "123 ABC COMPANY", "+66 91 889 8948", "hello", np.nan, "NULL"
    ]
})
df

[1]:

	phone
0	555-234-5678
1	(555) 234-5678
2	555.234.5678
3	555/234/5678
4	15551234567
5	(1) 555-234-5678
6	+1 (234) 567-8901 x. 1234
7	2345678901 extension 1234
8	2345678
9	800-299-JUNK
10	1-866-4ZIPCAR
11	123 ABC COMPANY
12	+66 91 889 8948
13	hello
14	NaN
15	NULL

1. Default `clean_phone()`¶

By default, the output_format parameter is set to “nanp” (NPA-NXX-XXXX) and the errors parameter is set to “coerce” (set to NaN when parsing is invalid).

[2]:

from dataprep.clean import clean_phone
clean_phone(df, "phone")

Phone Number Cleaning Report:
        10 values cleaned (62.5%)
        3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)

[2]:

	phone	phone_clean
0	555-234-5678	555-234-5678
1	(555) 234-5678	555-234-5678
2	555.234.5678	555-234-5678
3	555/234/5678	555-234-5678
4	15551234567	555-123-4567
5	(1) 555-234-5678	555-234-5678
6	+1 (234) 567-8901 x. 1234	234-567-8901 ext. 1234
7	2345678901 extension 1234	234-567-8901 ext. 1234
8	2345678	234-5678
9	800-299-JUNK	800-299-5865
10	1-866-4ZIPCAR	866-494-7227
11	123 ABC COMPANY	NaN
12	+66 91 889 8948	NaN
13	hello	NaN
14	NaN	NaN
15	NULL	NaN

Note that “555-234-5678” is considered not cleaned in the report since its resulting format is the same as the input. Also, “+66 91 889 8948” is invalid because it is not a Canadian or US phone number.

The letters in “800-299-JUNK” and “1-866-4ZIPCAR” are automatically converted to their number equivalents on a telephone keypad.

2. Output formats¶

This section demonstrates the supported phone number formats.

E.164 (e164)¶

[3]:

clean_phone(df, "phone", output_format="e164")

Phone Number Cleaning Report:
        10 values cleaned (62.5%)
        3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)

[3]:

	phone	phone_clean
0	555-234-5678	+15552345678
1	(555) 234-5678	+15552345678
2	555.234.5678	+15552345678
3	555/234/5678	+15552345678
4	15551234567	+15551234567
5	(1) 555-234-5678	+15552345678
6	+1 (234) 567-8901 x. 1234	+12345678901 ext. 1234
7	2345678901 extension 1234	+12345678901 ext. 1234
8	2345678	2345678
9	800-299-JUNK	+18002995865
10	1-866-4ZIPCAR	+18664947227
11	123 ABC COMPANY	NaN
12	+66 91 889 8948	NaN
13	hello	NaN
14	NaN	NaN
15	NULL	NaN

Note that the country code “+1” is not added to “2345678” as this would result in an invalid Canadian or US phone number.

national¶

[4]:

clean_phone(df, "phone", output_format="national")

Phone Number Cleaning Report:
        10 values cleaned (62.5%)
        3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)

[4]:

	phone	phone_clean
0	555-234-5678	(555) 234-5678
1	(555) 234-5678	(555) 234-5678
2	555.234.5678	(555) 234-5678
3	555/234/5678	(555) 234-5678
4	15551234567	(555) 123-4567
5	(1) 555-234-5678	(555) 234-5678
6	+1 (234) 567-8901 x. 1234	(234) 567-8901 ext. 1234
7	2345678901 extension 1234	(234) 567-8901 ext. 1234
8	2345678	234-5678
9	800-299-JUNK	(800) 299-5865
10	1-866-4ZIPCAR	(866) 494-7227
11	123 ABC COMPANY	NaN
12	+66 91 889 8948	NaN
13	hello	NaN
14	NaN	NaN
15	NULL	NaN

3. `split` parameter¶

The split parameter adds individual columns containing the cleaned phone number values to the given DataFrame.

[5]:

clean_phone(df, "phone", split=True)

Phone Number Cleaning Report:
        11 values cleaned (68.75%)
        3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)

[5]:

	phone	country_code	area_code	office_code	station_code	ext_num
0	555-234-5678	NaN	555	234	5678	NaN
1	(555) 234-5678	NaN	555	234	5678	NaN
2	555.234.5678	NaN	555	234	5678	NaN
3	555/234/5678	NaN	555	234	5678	NaN
4	15551234567	1	555	123	4567	NaN
5	(1) 555-234-5678	1	555	234	5678	NaN
6	+1 (234) 567-8901 x. 1234	1	234	567	8901	1234
7	2345678901 extension 1234	NaN	234	567	8901	1234
8	2345678	NaN	NaN	234	5678	NaN
9	800-299-JUNK	NaN	800	299	5865	NaN
10	1-866-4ZIPCAR	1	866	494	7227	NaN
11	123 ABC COMPANY	NaN	NaN	NaN	NaN	NaN
12	+66 91 889 8948	NaN	NaN	NaN	NaN	NaN
13	hello	NaN	NaN	NaN	NaN	NaN
14	NaN	NaN	NaN	NaN	NaN	NaN
15	NULL	NaN	NaN	NaN	NaN	NaN

4. `fix_missing` parameter¶

By default, the fix_missing parameter is set to “empty” (leave the missing country code as is). If set to “auto”, the country code is set to “1”.

`split` and `fix_missing`¶

[6]:

clean_phone(df, "phone", split=True, fix_missing="auto")

Phone Number Cleaning Report:
        11 values cleaned (68.75%)
        3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)

[6]:

	phone	country_code	area_code	office_code	station_code	ext_num
0	555-234-5678	1	555	234	5678	NaN
1	(555) 234-5678	1	555	234	5678	NaN
2	555.234.5678	1	555	234	5678	NaN
3	555/234/5678	1	555	234	5678	NaN
4	15551234567	1	555	123	4567	NaN
5	(1) 555-234-5678	1	555	234	5678	NaN
6	+1 (234) 567-8901 x. 1234	1	234	567	8901	1234
7	2345678901 extension 1234	1	234	567	8901	1234
8	2345678	NaN	NaN	234	5678	NaN
9	800-299-JUNK	1	800	299	5865	NaN
10	1-866-4ZIPCAR	1	866	494	7227	NaN
11	123 ABC COMPANY	NaN	NaN	NaN	NaN	NaN
12	+66 91 889 8948	NaN	NaN	NaN	NaN	NaN
13	hello	NaN	NaN	NaN	NaN	NaN
14	NaN	NaN	NaN	NaN	NaN	NaN
15	NULL	NaN	NaN	NaN	NaN	NaN

Again, note that the country code is not set to “1” for “2345678” as this would result in an invalid Canadian or US phone number.

5. `inplace` parameter¶

This deletes the given column from the returned DataFrame. A new column containing cleaned phone numbers is added with a title in the format "{original title}_clean".

[7]:

clean_phone(df, "phone", inplace=True)

Phone Number Cleaning Report:
        10 values cleaned (62.5%)
        3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)

[7]:

	phone_clean
0	555-234-5678
1	555-234-5678
2	555-234-5678
3	555-234-5678
4	555-123-4567
5	555-234-5678
6	234-567-8901 ext. 1234
7	234-567-8901 ext. 1234
8	234-5678
9	800-299-5865
10	866-494-7227
11	NaN
12	NaN
13	NaN
14	NaN
15	NaN

`inplace` and `split`¶

[8]:

clean_phone(df, "phone", split=True, inplace=True)

Phone Number Cleaning Report:
        11 values cleaned (68.75%)
        3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)

[8]:

	country_code	area_code	office_code	station_code	ext_num
0	NaN	555	234	5678	NaN
1	NaN	555	234	5678	NaN
2	NaN	555	234	5678	NaN
3	NaN	555	234	5678	NaN
4	1	555	123	4567	NaN
5	1	555	234	5678	NaN
6	1	234	567	8901	1234
7	NaN	234	567	8901	1234
8	NaN	NaN	234	5678	NaN
9	NaN	800	299	5865	NaN
10	1	866	494	7227	NaN
11	NaN	NaN	NaN	NaN	NaN
12	NaN	NaN	NaN	NaN	NaN
13	NaN	NaN	NaN	NaN	NaN
14	NaN	NaN	NaN	NaN	NaN
15	NaN	NaN	NaN	NaN	NaN

`inplace`, `split` and `fix_missing`¶

[9]:

clean_phone(df, "phone", split=True, inplace=True, fix_missing="auto")

Phone Number Cleaning Report:
        11 values cleaned (68.75%)
        3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)

[9]:

	country_code	area_code	office_code	station_code	ext_num
0	1	555	234	5678	NaN
1	1	555	234	5678	NaN
2	1	555	234	5678	NaN
3	1	555	234	5678	NaN
4	1	555	123	4567	NaN
5	1	555	234	5678	NaN
6	1	234	567	8901	1234
7	1	234	567	8901	1234
8	NaN	NaN	234	5678	NaN
9	1	800	299	5865	NaN
10	1	866	494	7227	NaN
11	NaN	NaN	NaN	NaN	NaN
12	NaN	NaN	NaN	NaN	NaN
13	NaN	NaN	NaN	NaN	NaN
14	NaN	NaN	NaN	NaN	NaN
15	NaN	NaN	NaN	NaN	NaN

6. `validate_phone()`¶

validate_phone() returns True when the input is a valid phone number. Otherwise it returns False. Valid types are the same as clean_phone().

[10]:

from dataprep.clean import validate_phone
print(validate_phone(1234))
print(validate_phone(2346789))
print(validate_phone("1 800 234 6789"))
print(validate_phone("+44 7700 900077"))
print(validate_phone("555-234-6789 ext 32"))
print(validate_phone("1-866-4ZIPCAR"))
print(validate_phone("123 ABC COMPANY"))

False
True
True
False
True
True
False

[11]:

df = pd.DataFrame({
    "phone": [
        "555-234-5678", "(555) 234-5678", "555.234.5678", "555/234/5678",
        15551234567, "(1) 555-234-5678", "+1 (234) 567-8901 x. 1234",
        "2345678901 extension 1234", "2345678", "800-299-JUNK", "1-866-4ZIPCAR",
        "123 ABC COMPANY", "+66 91 889 8948", "hello", np.nan, "NULL"
    ]
})
df["valid"] = validate_phone(df["phone"])
df

[11]:

	phone	valid
0	555-234-5678	True
1	(555) 234-5678	True
2	555.234.5678	True
3	555/234/5678	True
4	15551234567	True
5	(1) 555-234-5678	True
6	+1 (234) 567-8901 x. 1234	True
7	2345678901 extension 1234	True
8	2345678	True
9	800-299-JUNK	True
10	1-866-4ZIPCAR	True
11	123 ABC COMPANY	False
12	+66 91 889 8948	False
13	hello	False
14	NaN	False
15	NULL	False

Phone Numbers¶

Introduction¶

An example dataset containing phone numbers¶

1. Default clean_phone()¶

2. Output formats¶

E.164 (e164)¶

national¶

3. split parameter¶

4. fix_missing parameter¶

split and fix_missing¶

5. inplace parameter¶

inplace and split¶

inplace, split and fix_missing¶

6. validate_phone()¶