Phone Numbers

Introduction

The function clean_phone() cleans and standardizes a DataFrame column containing phone numbers. The function validate_phone() validates either a single phone number or a column of phone numbers, returning True if the value is valid, and False otherwise.

Currently, Canadian/US phone numbers having the following format are supported as valid input:

  • Country code of “1” (optional)

  • Three-digit area code (optional)

  • Three-digit central office code

  • Four-digit station code

  • Extension number preceded by “#”, “x”, “ext”, or “extension” (optional)

A combination of numbers and uppercase letters is allowed within the central office code and the station code.

Various delimiters between the digits are also allowed, such as spaces, hyphens, periods, brackets, and/or forward slashes.

Phone numbers can be converted to the following formats via the output_format parameter:

  • North American Numbering Plan (nanp): NPA-NXX-XXXX

  • E.164 (e164): +1NPANXXXXXX

  • national: (NPA) NXX-XXXX

Invalid parsing is handled with the errors parameter:

  • “coerce” (default): invalid parsing will be set to NaN

  • “ignore”: invalid parsing will return the input

  • “raise”: invalid parsing will raise an exception

After cleaning, a report is printed that provides the following information:

  • How many values were cleaned (the value must have been transformed).

  • How many values could not be parsed.

  • A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.

The following sections demonstrate the functionality of clean_phone() and validate_phone().

An example dataset containing phone numbers

[1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({
    "phone": [
        "555-234-5678", "(555) 234-5678", "555.234.5678", "555/234/5678",
        15551234567, "(1) 555-234-5678", "+1 (234) 567-8901 x. 1234",
        "2345678901 extension 1234", "2345678", "800-299-JUNK", "1-866-4ZIPCAR",
        "123 ABC COMPANY", "+66 91 889 8948", "hello", np.nan, "NULL"
    ]
})
df
[1]:
phone
0 555-234-5678
1 (555) 234-5678
2 555.234.5678
3 555/234/5678
4 15551234567
5 (1) 555-234-5678
6 +1 (234) 567-8901 x. 1234
7 2345678901 extension 1234
8 2345678
9 800-299-JUNK
10 1-866-4ZIPCAR
11 123 ABC COMPANY
12 +66 91 889 8948
13 hello
14 NaN
15 NULL

1. Default clean_phone()

By default, the output_format parameter is set to “nanp” (NPA-NXX-XXXX) and the errors parameter is set to “coerce” (set to NaN when parsing is invalid).

[2]:
from dataprep.clean import clean_phone
clean_phone(df, "phone")
Phone Number Cleaning Report:
        10 values cleaned (62.5%)
        3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)
[2]:
phone phone_clean
0 555-234-5678 555-234-5678
1 (555) 234-5678 555-234-5678
2 555.234.5678 555-234-5678
3 555/234/5678 555-234-5678
4 15551234567 555-123-4567
5 (1) 555-234-5678 555-234-5678
6 +1 (234) 567-8901 x. 1234 234-567-8901 ext. 1234
7 2345678901 extension 1234 234-567-8901 ext. 1234
8 2345678 234-5678
9 800-299-JUNK 800-299-5865
10 1-866-4ZIPCAR 866-494-7227
11 123 ABC COMPANY NaN
12 +66 91 889 8948 NaN
13 hello NaN
14 NaN NaN
15 NULL NaN

Note that “555-234-5678” is considered not cleaned in the report since its resulting format is the same as the input. Also, “+66 91 889 8948” is invalid because it is not a Canadian or US phone number.

The letters in “800-299-JUNK” and “1-866-4ZIPCAR” are automatically converted to their number equivalents on a telephone keypad.

2. Output formats

This section demonstrates the supported phone number formats.

E.164 (e164)

[3]:
clean_phone(df, "phone", output_format="e164")
Phone Number Cleaning Report:
        10 values cleaned (62.5%)
        3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)
[3]:
phone phone_clean
0 555-234-5678 +15552345678
1 (555) 234-5678 +15552345678
2 555.234.5678 +15552345678
3 555/234/5678 +15552345678
4 15551234567 +15551234567
5 (1) 555-234-5678 +15552345678
6 +1 (234) 567-8901 x. 1234 +12345678901 ext. 1234
7 2345678901 extension 1234 +12345678901 ext. 1234
8 2345678 2345678
9 800-299-JUNK +18002995865
10 1-866-4ZIPCAR +18664947227
11 123 ABC COMPANY NaN
12 +66 91 889 8948 NaN
13 hello NaN
14 NaN NaN
15 NULL NaN

Note that the country code “+1” is not added to “2345678” as this would result in an invalid Canadian or US phone number.

national

[4]:
clean_phone(df, "phone", output_format="national")
Phone Number Cleaning Report:
        10 values cleaned (62.5%)
        3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)
[4]:
phone phone_clean
0 555-234-5678 (555) 234-5678
1 (555) 234-5678 (555) 234-5678
2 555.234.5678 (555) 234-5678
3 555/234/5678 (555) 234-5678
4 15551234567 (555) 123-4567
5 (1) 555-234-5678 (555) 234-5678
6 +1 (234) 567-8901 x. 1234 (234) 567-8901 ext. 1234
7 2345678901 extension 1234 (234) 567-8901 ext. 1234
8 2345678 234-5678
9 800-299-JUNK (800) 299-5865
10 1-866-4ZIPCAR (866) 494-7227
11 123 ABC COMPANY NaN
12 +66 91 889 8948 NaN
13 hello NaN
14 NaN NaN
15 NULL NaN

3. split parameter

The split parameter adds individual columns containing the cleaned phone number values to the given DataFrame.

[5]:
clean_phone(df, "phone", split=True)
Phone Number Cleaning Report:
        11 values cleaned (68.75%)
        3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)
[5]:
phone country_code area_code office_code station_code ext_num
0 555-234-5678 NaN 555 234 5678 NaN
1 (555) 234-5678 NaN 555 234 5678 NaN
2 555.234.5678 NaN 555 234 5678 NaN
3 555/234/5678 NaN 555 234 5678 NaN
4 15551234567 1 555 123 4567 NaN
5 (1) 555-234-5678 1 555 234 5678 NaN
6 +1 (234) 567-8901 x. 1234 1 234 567 8901 1234
7 2345678901 extension 1234 NaN 234 567 8901 1234
8 2345678 NaN NaN 234 5678 NaN
9 800-299-JUNK NaN 800 299 5865 NaN
10 1-866-4ZIPCAR 1 866 494 7227 NaN
11 123 ABC COMPANY NaN NaN NaN NaN NaN
12 +66 91 889 8948 NaN NaN NaN NaN NaN
13 hello NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN NaN
15 NULL NaN NaN NaN NaN NaN

4. fix_missing parameter

By default, the fix_missing parameter is set to “empty” (leave the missing country code as is). If set to “auto”, the country code is set to “1”.

split and fix_missing

[6]:
clean_phone(df, "phone", split=True, fix_missing="auto")
Phone Number Cleaning Report:
        11 values cleaned (68.75%)
        3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)
[6]:
phone country_code area_code office_code station_code ext_num
0 555-234-5678 1 555 234 5678 NaN
1 (555) 234-5678 1 555 234 5678 NaN
2 555.234.5678 1 555 234 5678 NaN
3 555/234/5678 1 555 234 5678 NaN
4 15551234567 1 555 123 4567 NaN
5 (1) 555-234-5678 1 555 234 5678 NaN
6 +1 (234) 567-8901 x. 1234 1 234 567 8901 1234
7 2345678901 extension 1234 1 234 567 8901 1234
8 2345678 NaN NaN 234 5678 NaN
9 800-299-JUNK 1 800 299 5865 NaN
10 1-866-4ZIPCAR 1 866 494 7227 NaN
11 123 ABC COMPANY NaN NaN NaN NaN NaN
12 +66 91 889 8948 NaN NaN NaN NaN NaN
13 hello NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN NaN
15 NULL NaN NaN NaN NaN NaN

Again, note that the country code is not set to “1” for “2345678” as this would result in an invalid Canadian or US phone number.

5. inplace parameter

This deletes the given column from the returned DataFrame. A new column containing cleaned phone numbers is added with a title in the format "{original title}_clean".

[7]:
clean_phone(df, "phone", inplace=True)
Phone Number Cleaning Report:
        10 values cleaned (62.5%)
        3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)
[7]:
phone_clean
0 555-234-5678
1 555-234-5678
2 555-234-5678
3 555-234-5678
4 555-123-4567
5 555-234-5678
6 234-567-8901 ext. 1234
7 234-567-8901 ext. 1234
8 234-5678
9 800-299-5865
10 866-494-7227
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN

inplace and split

[8]:
clean_phone(df, "phone", split=True, inplace=True)
Phone Number Cleaning Report:
        11 values cleaned (68.75%)
        3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)
[8]:
country_code area_code office_code station_code ext_num
0 NaN 555 234 5678 NaN
1 NaN 555 234 5678 NaN
2 NaN 555 234 5678 NaN
3 NaN 555 234 5678 NaN
4 1 555 123 4567 NaN
5 1 555 234 5678 NaN
6 1 234 567 8901 1234
7 NaN 234 567 8901 1234
8 NaN NaN 234 5678 NaN
9 NaN 800 299 5865 NaN
10 1 866 494 7227 NaN
11 NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN
15 NaN NaN NaN NaN NaN

inplace, split and fix_missing

[9]:
clean_phone(df, "phone", split=True, inplace=True, fix_missing="auto")
Phone Number Cleaning Report:
        11 values cleaned (68.75%)
        3 values unable to be parsed (18.75%), set to NaN
Result contains 11 (68.75%) values in the correct format and 5 null values (31.25%)
[9]:
country_code area_code office_code station_code ext_num
0 1 555 234 5678 NaN
1 1 555 234 5678 NaN
2 1 555 234 5678 NaN
3 1 555 234 5678 NaN
4 1 555 123 4567 NaN
5 1 555 234 5678 NaN
6 1 234 567 8901 1234
7 1 234 567 8901 1234
8 NaN NaN 234 5678 NaN
9 1 800 299 5865 NaN
10 1 866 494 7227 NaN
11 NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN
15 NaN NaN NaN NaN NaN

6. validate_phone()

validate_phone() returns True when the input is a valid phone number. Otherwise it returns False. Valid types are the same as clean_phone().

[10]:
from dataprep.clean import validate_phone
print(validate_phone(1234))
print(validate_phone(2346789))
print(validate_phone("1 800 234 6789"))
print(validate_phone("+44 7700 900077"))
print(validate_phone("555-234-6789 ext 32"))
print(validate_phone("1-866-4ZIPCAR"))
print(validate_phone("123 ABC COMPANY"))
False
True
True
False
True
True
False
[11]:
df = pd.DataFrame({
    "phone": [
        "555-234-5678", "(555) 234-5678", "555.234.5678", "555/234/5678",
        15551234567, "(1) 555-234-5678", "+1 (234) 567-8901 x. 1234",
        "2345678901 extension 1234", "2345678", "800-299-JUNK", "1-866-4ZIPCAR",
        "123 ABC COMPANY", "+66 91 889 8948", "hello", np.nan, "NULL"
    ]
})
df["valid"] = validate_phone(df["phone"])
df
[11]:
phone valid
0 555-234-5678 True
1 (555) 234-5678 True
2 555.234.5678 True
3 555/234/5678 True
4 15551234567 True
5 (1) 555-234-5678 True
6 +1 (234) 567-8901 x. 1234 True
7 2345678901 extension 1234 True
8 2345678 True
9 800-299-JUNK True
10 1-866-4ZIPCAR True
11 123 ABC COMPANY False
12 +66 91 889 8948 False
13 hello False
14 NaN False
15 NULL False