US Street Addresses

Introduction

The function clean_address() cleans a column containing United States street addresses and standardizes them in a desired format. The function validate_address() validates either a single address or a column of addresses, returning True if the value is valid, and False otherwise. Address parsing is done using the usaddress library.

Addresses can be converted to a specific format via the output_format parameter, the following keywords are supported. Any missing attributes are omitted.

  • house_number: (‘1234’)

  • street_prefix_abbr: (‘N’, ‘S’, ‘E’, or ‘W’)

  • street_prefix_full: (‘North’, ‘South’, ‘East’, or ‘West’)

  • street_name: (‘Main’)

  • street_suffix_abbr: (‘St’, ‘Ave’)

  • street_suffix_full: (‘Street’, ‘Avenue’)

  • apartment: (‘Apt 1’)

  • building: (‘Staples Center’)

  • city: (‘Los Angeles’)

  • state_abbr: (‘CA’)

  • state_full: (‘California’)

  • zipcode: (‘57903’)

The default output_format is “(building) house_number street_prefix_abbr street_name street_suffix_abbr, apartment, city, state_abbr zipcode”

The must_contain parameter takes a tuple containing parts of the address that must be included for the address to be successfully cleaned, the following keywords are supported.

  • house_number: (‘1234’)

  • street_prefix: (‘N’, ‘North’)

  • street_name: (‘Main’)

  • street_suffix: (‘St’, ‘Avenue’)

  • apartment: (‘Apt 1’)

  • building: (‘Staples Center’)

  • city: (‘Los Angeles’)

  • state: (‘CA’, ‘California’)

  • zipcode: (‘57903’)

The default value for must_contain is ("house_number", "street_name"). Therefore, by default addresses must contain a house number and street name to be successfully cleaned.

Invalid parsing is handled with the errors parameter:

  • “coerce” (default): invalid parsing will be set to NaN

  • “ignore”: invalid parsing will return the input

  • “raise”: invalid parsing will raise an exception

After cleaning, a report is printed that provides the following information:

  • How many values were cleaned (the value must have been transformed).

  • How many values could not be parsed.

  • A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.

The following sections demonstrate the functionality of clean_address() and validate_address().

An example dirty dataset

[1]:
import pandas as pd
import numpy as np
df = pd.DataFrame(
    {
        "address": [
            "123 Pine Ave.",
            "main st",
            "1234 west main heights 57033",
            "apt 1 789 s maple rd manhattan",
            "robie house, 789 north main street",
            "1111 S Figueroa St, Los Angeles, CA 90015",
            "(staples center) 1111 S Figueroa St, Los Angeles",
            "hello",
            np.nan,
            "NULL"
        ]
    }
)
df
[1]:
address
0 123 Pine Ave.
1 main st
2 1234 west main heights 57033
3 apt 1 789 s maple rd manhattan
4 robie house, 789 north main street
5 1111 S Figueroa St, Los Angeles, CA 90015
6 (staples center) 1111 S Figueroa St, Los Angeles
7 hello
8 NaN
9 NULL

1. Default clean_address()

By default the output_format parameter is set to “(building) house_number street_prefix_abbr street_name street_suffix_abbr apartment, city, state_abbr zipcode” and the must_contain parameter is set ("house_number", "street_name"). The errors parameter is set to “coerce” (set NaN when parsing is invalid).

[2]:
from dataprep.clean import clean_address
clean_address(df, "address")
Address Cleaning Report:
        5 values cleaned (50.0%)
        2 values unable to be parsed (20.0%), set to NaN
Result contains 6 (60.0%) values in the correct format and 4 null values (40.0%)
[2]:
address address_clean
0 123 Pine Ave. 123 Pine Ave.
1 main st NaN
2 1234 west main heights 57033 1234 W. Main Hts., 57033
3 apt 1 789 s maple rd manhattan 789 S. Maple Rd., Apt 1, Manhattan
4 robie house, 789 north main street (Robie House) 789 N. Main St.
5 1111 S Figueroa St, Los Angeles, CA 90015 1111 S. Figueroa St., Los Angeles, CA 90015
6 (staples center) 1111 S Figueroa St, Los Angeles (Staples Center) 1111 S. Figueroa St., Los Ang...
7 hello NaN
8 NaN NaN
9 NULL NaN

Note that “123 Pine Ave.” is considered not cleaned in the report since its resulting format is the same as the input. Also, “main st” is invalid since it does not contain a house number.

2. Output formats

[3]:
clean_address(
    df,
    "address",
    output_format="(zipcode) street_prefix_full street_name ~state_full~"
)
Address Cleaning Report:
        6 values cleaned (60.0%)
        2 values unable to be parsed (20.0%), set to NaN
Result contains 6 (60.0%) values in the correct format and 4 null values (40.0%)
[3]:
address address_clean
0 123 Pine Ave. Pine
1 main st NaN
2 1234 west main heights 57033 (57033) West Main
3 apt 1 789 s maple rd manhattan South Maple
4 robie house, 789 north main street North Main
5 1111 S Figueroa St, Los Angeles, CA 90015 (90015) South Figueroa ~California~
6 (staples center) 1111 S Figueroa St, Los Angeles South Figueroa
7 hello NaN
8 NaN NaN
9 NULL NaN
[4]:
clean_address(
    df,
    "address",
    output_format="house_number street_name street_suffix_full (building)",
)
Address Cleaning Report:
        6 values cleaned (60.0%)
        2 values unable to be parsed (20.0%), set to NaN
Result contains 6 (60.0%) values in the correct format and 4 null values (40.0%)
[4]:
address address_clean
0 123 Pine Ave. 123 Pine Avenue
1 main st NaN
2 1234 west main heights 57033 1234 Main Heights
3 apt 1 789 s maple rd manhattan 789 Maple Road
4 robie house, 789 north main street 789 Main Street (Robie House)
5 1111 S Figueroa St, Los Angeles, CA 90015 1111 Figueroa Street
6 (staples center) 1111 S Figueroa St, Los Angeles 1111 Figueroa Street (Staples Center)
7 hello NaN
8 NaN NaN
9 NULL NaN

Splitting The Output

A tab character can be placed between address keywords to split the output into separate columns. The column names are taken from the output format.

[5]:
clean_address(
    df,
    "address",
    output_format="house_number street_name \t state_full"
)
Address Cleaning Report:
        6 values cleaned (60.0%)
        2 values unable to be parsed (20.0%), set to NaN
Result contains 6 (60.0%) values in the correct format and 4 null values (40.0%)
[5]:
address house_number street_name state_full
0 123 Pine Ave. 123 Pine NaN
1 main st NaN NaN
2 1234 west main heights 57033 1234 Main NaN
3 apt 1 789 s maple rd manhattan 789 Maple NaN
4 robie house, 789 north main street 789 Main NaN
5 1111 S Figueroa St, Los Angeles, CA 90015 1111 Figueroa California
6 (staples center) 1111 S Figueroa St, Los Angeles 1111 Figueroa NaN
7 hello NaN NaN
8 NaN NaN NaN
9 NULL NaN NaN

3. must_contain parameter

This parameter takes a tuple containing parts of the address that must be included for the address to be successfully cleaned.

[6]:
clean_address(
    df, "address", must_contain=("house_number", "zipcode")
)
Address Cleaning Report:
        2 values cleaned (20.0%)
        6 values unable to be parsed (60.0%), set to NaN
Result contains 2 (20.0%) values in the correct format and 8 null values (80.0%)
[6]:
address address_clean
0 123 Pine Ave. NaN
1 main st NaN
2 1234 west main heights 57033 1234 W. Main Hts., 57033
3 apt 1 789 s maple rd manhattan NaN
4 robie house, 789 north main street NaN
5 1111 S Figueroa St, Los Angeles, CA 90015 1111 S. Figueroa St., Los Angeles, CA 90015
6 (staples center) 1111 S Figueroa St, Los Angeles NaN
7 hello NaN
8 NaN NaN
9 NULL NaN

4. split parameter

The split parameter adds individual columns containing the cleaned address values to the given DataFrame.

[7]:
clean_address(df, "address", split=True)
Address Cleaning Report:
        6 values cleaned (60.0%)
        2 values unable to be parsed (20.0%), set to NaN
Result contains 6 (60.0%) values in the correct format and 4 null values (40.0%)
[7]:
address building house_number street_prefix_abbr street_name street_suffix_abbr apartment city state_abbr zipcode
0 123 Pine Ave. NaN 123 NaN Pine Ave. NaN NaN NaN NaN
1 main st NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1234 west main heights 57033 NaN 1234 W. Main Hts. NaN NaN NaN 57033
3 apt 1 789 s maple rd manhattan NaN 789 S. Maple Rd. Apt 1 Manhattan NaN NaN
4 robie house, 789 north main street Robie House 789 N. Main St. NaN NaN NaN NaN
5 1111 S Figueroa St, Los Angeles, CA 90015 NaN 1111 S. Figueroa St. NaN Los Angeles CA 90015
6 (staples center) 1111 S Figueroa St, Los Angeles Staples Center 1111 S. Figueroa St. NaN Los Angeles NaN NaN
7 hello NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 NULL NaN NaN NaN NaN NaN NaN NaN NaN NaN

Setting split to True is equivalent to placing tabs between each word in the output_format and removing all characters that are not part of an address keyword (ie. commas). Column names are taken from the address keywords in the output_format.

[8]:
clean_address(
    df,
    "address",
    split=True,
    output_format="house_number, street_name, building"
)
Address Cleaning Report:
        6 values cleaned (60.0%)
        2 values unable to be parsed (20.0%), set to NaN
Result contains 6 (60.0%) values in the correct format and 4 null values (40.0%)
[8]:
address house_number street_name building
0 123 Pine Ave. 123 Pine NaN
1 main st NaN NaN NaN
2 1234 west main heights 57033 1234 Main NaN
3 apt 1 789 s maple rd manhattan 789 Maple NaN
4 robie house, 789 north main street 789 Main Robie House
5 1111 S Figueroa St, Los Angeles, CA 90015 1111 Figueroa NaN
6 (staples center) 1111 S Figueroa St, Los Angeles 1111 Figueroa Staples Center
7 hello NaN NaN NaN
8 NaN NaN NaN NaN
9 NULL NaN NaN NaN

5. inplace parameter

This just deletes the given column from the returned dataframe. A new column containing cleaned addresses is added with a title in the format "{original title}_clean".

[9]:
clean_address(df, "address", inplace=True)
Address Cleaning Report:
        5 values cleaned (50.0%)
        2 values unable to be parsed (20.0%), set to NaN
Result contains 6 (60.0%) values in the correct format and 4 null values (40.0%)
[9]:
address_clean
0 123 Pine Ave.
1 NaN
2 1234 W. Main Hts., 57033
3 789 S. Maple Rd., Apt 1, Manhattan
4 (Robie House) 789 N. Main St.
5 1111 S. Figueroa St., Los Angeles, CA 90015
6 (Staples Center) 1111 S. Figueroa St., Los Ang...
7 NaN
8 NaN
9 NaN

inplace and split

[10]:
clean_address(df, "address", inplace=True, split=True)
Address Cleaning Report:
        6 values cleaned (60.0%)
        2 values unable to be parsed (20.0%), set to NaN
Result contains 6 (60.0%) values in the correct format and 4 null values (40.0%)
[10]:
building house_number street_prefix_abbr street_name street_suffix_abbr apartment city state_abbr zipcode
0 NaN 123 NaN Pine Ave. NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN 1234 W. Main Hts. NaN NaN NaN 57033
3 NaN 789 S. Maple Rd. Apt 1 Manhattan NaN NaN
4 Robie House 789 N. Main St. NaN NaN NaN NaN
5 NaN 1111 S. Figueroa St. NaN Los Angeles CA 90015
6 Staples Center 1111 S. Figueroa St. NaN Los Angeles NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN

6. validate_address()

validate_address() returns True when the input is a valid address value otherwise it returns False. Valid types are the same as clean_address().

[11]:
from dataprep.clean import validate_address

print(validate_address("123 main st"))
print(validate_address("main st"))
print(validate_address("apt 1 s maple rd manhattan", must_contain=("apartment",)))
print(validate_address("(staples center) 1111 S Figueroa St, Los Angeles"))
print(validate_address("789 North Maple Way Boston, MA"))
True
False
True
True
True

validate_address() on a pandas series

[12]:
df["valid"] = validate_address(df["address"])
df
[12]:
address valid
0 123 Pine Ave. True
1 main st False
2 1234 west main heights 57033 True
3 apt 1 789 s maple rd manhattan True
4 robie house, 789 north main street True
5 1111 S Figueroa St, Los Angeles, CA 90015 True
6 (staples center) 1111 S Figueroa St, Los Angeles True
7 hello False
8 NaN False
9 NULL False

must_contain

[13]:
df["valid"] = validate_address(df["address"], must_contain=("building", "city"))
df
[13]:
address valid
0 123 Pine Ave. False
1 main st False
2 1234 west main heights 57033 False
3 apt 1 789 s maple rd manhattan False
4 robie house, 789 north main street False
5 1111 S Figueroa St, Los Angeles, CA 90015 False
6 (staples center) 1111 S Figueroa St, Los Angeles True
7 hello False
8 NaN False
9 NULL False