US Street Addresses¶

Introduction¶

The function clean_address() cleans a column containing United States street addresses and standardizes them in a desired format. The function validate_address() validates either a single address or a column of addresses, returning True if the value is valid, and False otherwise. Address parsing is done using the usaddress library.

Addresses can be converted to a specific format via the output_format parameter, the following keywords are supported. Any missing attributes are omitted.

house_number: (‘1234’)
street_prefix_abbr: (‘N’, ‘S’, ‘E’, or ‘W’)
street_prefix_full: (‘North’, ‘South’, ‘East’, or ‘West’)
street_name: (‘Main’)
street_suffix_abbr: (‘St’, ‘Ave’)
street_suffix_full: (‘Street’, ‘Avenue’)
apartment: (‘Apt 1’)
building: (‘Staples Center’)
city: (‘Los Angeles’)
state_abbr: (‘CA’)
state_full: (‘California’)
zipcode: (‘57903’)

The default output_format is “(building) house_number street_prefix_abbr street_name street_suffix_abbr, apartment, city, state_abbr zipcode”

The must_contain parameter takes a tuple containing parts of the address that must be included for the address to be successfully cleaned, the following keywords are supported.

house_number: (‘1234’)
street_prefix: (‘N’, ‘North’)
street_name: (‘Main’)
street_suffix: (‘St’, ‘Avenue’)
apartment: (‘Apt 1’)
building: (‘Staples Center’)
city: (‘Los Angeles’)
state: (‘CA’, ‘California’)
zipcode: (‘57903’)

The default value for must_contain is ("house_number", "street_name"). Therefore, by default addresses must contain a house number and street name to be successfully cleaned.

Invalid parsing is handled with the errors parameter:

“coerce” (default): invalid parsing will be set to NaN
“ignore”: invalid parsing will return the input
“raise”: invalid parsing will raise an exception

After cleaning, a report is printed that provides the following information:

How many values were cleaned (the value must have been transformed).
How many values could not be parsed.
A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.

The following sections demonstrate the functionality of clean_address() and validate_address().

An example dirty dataset¶

[1]:

import pandas as pd
import numpy as np
df = pd.DataFrame(
    {
        "address": [
            "123 Pine Ave.",
            "main st",
            "1234 west main heights 57033",
            "apt 1 789 s maple rd manhattan",
            "robie house, 789 north main street",
            "1111 S Figueroa St, Los Angeles, CA 90015",
            "(staples center) 1111 S Figueroa St, Los Angeles",
            "hello",
            np.nan,
            "NULL"
        ]
    }
)
df

[1]:

	address
0	123 Pine Ave.
1	main st
2	1234 west main heights 57033
3	apt 1 789 s maple rd manhattan
4	robie house, 789 north main street
5	1111 S Figueroa St, Los Angeles, CA 90015
6	(staples center) 1111 S Figueroa St, Los Angeles
7	hello
8	NaN
9	NULL

1. Default `clean_address()`¶

By default the output_format parameter is set to “(building) house_number street_prefix_abbr street_name street_suffix_abbr apartment, city, state_abbr zipcode” and the must_contain parameter is set ("house_number", "street_name"). The errors parameter is set to “coerce” (set NaN when parsing is invalid).

[2]:

from dataprep.clean import clean_address
clean_address(df, "address")

Address Cleaning Report:
        5 values cleaned (50.0%)
        2 values unable to be parsed (20.0%), set to NaN
Result contains 6 (60.0%) values in the correct format and 4 null values (40.0%)

[2]:

	address	address_clean
0	123 Pine Ave.	123 Pine Ave.
1	main st	NaN
2	1234 west main heights 57033	1234 W. Main Hts., 57033
3	apt 1 789 s maple rd manhattan	789 S. Maple Rd., Apt 1, Manhattan
4	robie house, 789 north main street	(Robie House) 789 N. Main St.
5	1111 S Figueroa St, Los Angeles, CA 90015	1111 S. Figueroa St., Los Angeles, CA 90015
6	(staples center) 1111 S Figueroa St, Los Angeles	(Staples Center) 1111 S. Figueroa St., Los Ang...
7	hello	NaN
8	NaN	NaN
9	NULL	NaN

Note that “123 Pine Ave.” is considered not cleaned in the report since its resulting format is the same as the input. Also, “main st” is invalid since it does not contain a house number.

2. Output formats¶

[3]:

clean_address(
    df,
    "address",
    output_format="(zipcode) street_prefix_full street_name ~state_full~"
)

Address Cleaning Report:
        6 values cleaned (60.0%)
        2 values unable to be parsed (20.0%), set to NaN
Result contains 6 (60.0%) values in the correct format and 4 null values (40.0%)

[3]:

	address	address_clean
0	123 Pine Ave.	Pine
1	main st	NaN
2	1234 west main heights 57033	(57033) West Main
3	apt 1 789 s maple rd manhattan	South Maple
4	robie house, 789 north main street	North Main
5	1111 S Figueroa St, Los Angeles, CA 90015	(90015) South Figueroa ~California~
6	(staples center) 1111 S Figueroa St, Los Angeles	South Figueroa
7	hello	NaN
8	NaN	NaN
9	NULL	NaN

[4]:

clean_address(
    df,
    "address",
    output_format="house_number street_name street_suffix_full (building)",
)

Address Cleaning Report:
        6 values cleaned (60.0%)
        2 values unable to be parsed (20.0%), set to NaN
Result contains 6 (60.0%) values in the correct format and 4 null values (40.0%)

[4]:

	address	address_clean
0	123 Pine Ave.	123 Pine Avenue
1	main st	NaN
2	1234 west main heights 57033	1234 Main Heights
3	apt 1 789 s maple rd manhattan	789 Maple Road
4	robie house, 789 north main street	789 Main Street (Robie House)
5	1111 S Figueroa St, Los Angeles, CA 90015	1111 Figueroa Street
6	(staples center) 1111 S Figueroa St, Los Angeles	1111 Figueroa Street (Staples Center)
7	hello	NaN
8	NaN	NaN
9	NULL	NaN

Splitting The Output¶

A tab character can be placed between address keywords to split the output into separate columns. The column names are taken from the output format.

[5]:

clean_address(
    df,
    "address",
    output_format="house_number street_name \t state_full"
)

Address Cleaning Report:
        6 values cleaned (60.0%)
        2 values unable to be parsed (20.0%), set to NaN
Result contains 6 (60.0%) values in the correct format and 4 null values (40.0%)

[5]:

	address	house_number street_name	state_full
0	123 Pine Ave.	123 Pine	NaN
1	main st	NaN	NaN
2	1234 west main heights 57033	1234 Main	NaN
3	apt 1 789 s maple rd manhattan	789 Maple	NaN
4	robie house, 789 north main street	789 Main	NaN
5	1111 S Figueroa St, Los Angeles, CA 90015	1111 Figueroa	California
6	(staples center) 1111 S Figueroa St, Los Angeles	1111 Figueroa	NaN
7	hello	NaN	NaN
8	NaN	NaN	NaN
9	NULL	NaN	NaN

3. `must_contain` parameter¶

This parameter takes a tuple containing parts of the address that must be included for the address to be successfully cleaned.

[6]:

clean_address(
    df, "address", must_contain=("house_number", "zipcode")
)

Address Cleaning Report:
        2 values cleaned (20.0%)
        6 values unable to be parsed (60.0%), set to NaN
Result contains 2 (20.0%) values in the correct format and 8 null values (80.0%)

[6]:

	address	address_clean
0	123 Pine Ave.	NaN
1	main st	NaN
2	1234 west main heights 57033	1234 W. Main Hts., 57033
3	apt 1 789 s maple rd manhattan	NaN
4	robie house, 789 north main street	NaN
5	1111 S Figueroa St, Los Angeles, CA 90015	1111 S. Figueroa St., Los Angeles, CA 90015
6	(staples center) 1111 S Figueroa St, Los Angeles	NaN
7	hello	NaN
8	NaN	NaN
9	NULL	NaN

4. `split` parameter¶

The split parameter adds individual columns containing the cleaned address values to the given DataFrame.

[7]:

clean_address(df, "address", split=True)

Address Cleaning Report:
        6 values cleaned (60.0%)
        2 values unable to be parsed (20.0%), set to NaN
Result contains 6 (60.0%) values in the correct format and 4 null values (40.0%)

[7]:

	address	building	house_number	street_prefix_abbr	street_name	street_suffix_abbr	apartment	city	state_abbr	zipcode
0	123 Pine Ave.	NaN	123	NaN	Pine	Ave.	NaN	NaN	NaN	NaN
1	main st	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	1234 west main heights 57033	NaN	1234	W.	Main	Hts.	NaN	NaN	NaN	57033
3	apt 1 789 s maple rd manhattan	NaN	789	S.	Maple	Rd.	Apt 1	Manhattan	NaN	NaN
4	robie house, 789 north main street	Robie House	789	N.	Main	St.	NaN	NaN	NaN	NaN
5	1111 S Figueroa St, Los Angeles, CA 90015	NaN	1111	S.	Figueroa	St.	NaN	Los Angeles	CA	90015
6	(staples center) 1111 S Figueroa St, Los Angeles	Staples Center	1111	S.	Figueroa	St.	NaN	Los Angeles	NaN	NaN
7	hello	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	NULL	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Setting split to True is equivalent to placing tabs between each word in the output_format and removing all characters that are not part of an address keyword (ie. commas). Column names are taken from the address keywords in the output_format.

[8]:

clean_address(
    df,
    "address",
    split=True,
    output_format="house_number, street_name, building"
)

Address Cleaning Report:
        6 values cleaned (60.0%)
        2 values unable to be parsed (20.0%), set to NaN
Result contains 6 (60.0%) values in the correct format and 4 null values (40.0%)

[8]:

	address	house_number	street_name	building
0	123 Pine Ave.	123	Pine	NaN
1	main st	NaN	NaN	NaN
2	1234 west main heights 57033	1234	Main	NaN
3	apt 1 789 s maple rd manhattan	789	Maple	NaN
4	robie house, 789 north main street	789	Main	Robie House
5	1111 S Figueroa St, Los Angeles, CA 90015	1111	Figueroa	NaN
6	(staples center) 1111 S Figueroa St, Los Angeles	1111	Figueroa	Staples Center
7	hello	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN
9	NULL	NaN	NaN	NaN

5. `inplace` parameter¶

This just deletes the given column from the returned dataframe. A new column containing cleaned addresses is added with a title in the format "{original title}_clean".

[9]:

clean_address(df, "address", inplace=True)

Address Cleaning Report:
        5 values cleaned (50.0%)
        2 values unable to be parsed (20.0%), set to NaN
Result contains 6 (60.0%) values in the correct format and 4 null values (40.0%)

[9]:

	address_clean
0	123 Pine Ave.
1	NaN
2	1234 W. Main Hts., 57033
3	789 S. Maple Rd., Apt 1, Manhattan
4	(Robie House) 789 N. Main St.
5	1111 S. Figueroa St., Los Angeles, CA 90015
6	(Staples Center) 1111 S. Figueroa St., Los Ang...
7	NaN
8	NaN
9	NaN

`inplace` and `split`¶

[10]:

clean_address(df, "address", inplace=True, split=True)

Address Cleaning Report:
        6 values cleaned (60.0%)
        2 values unable to be parsed (20.0%), set to NaN
Result contains 6 (60.0%) values in the correct format and 4 null values (40.0%)

[10]:

	building	house_number	street_prefix_abbr	street_name	street_suffix_abbr	apartment	city	state_abbr	zipcode
0	NaN	123	NaN	Pine	Ave.	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	1234	W.	Main	Hts.	NaN	NaN	NaN	57033
3	NaN	789	S.	Maple	Rd.	Apt 1	Manhattan	NaN	NaN
4	Robie House	789	N.	Main	St.	NaN	NaN	NaN	NaN
5	NaN	1111	S.	Figueroa	St.	NaN	Los Angeles	CA	90015
6	Staples Center	1111	S.	Figueroa	St.	NaN	Los Angeles	NaN	NaN
7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

6. `validate_address()`¶

validate_address() returns True when the input is a valid address value otherwise it returns False. Valid types are the same as clean_address().

[11]:

from dataprep.clean import validate_address

print(validate_address("123 main st"))
print(validate_address("main st"))
print(validate_address("apt 1 s maple rd manhattan", must_contain=("apartment",)))
print(validate_address("(staples center) 1111 S Figueroa St, Los Angeles"))
print(validate_address("789 North Maple Way Boston, MA"))

True
False
True
True
True

`validate_address()` on a pandas series¶

[12]:

df["valid"] = validate_address(df["address"])
df

[12]:

	address	valid
0	123 Pine Ave.	True
1	main st	False
2	1234 west main heights 57033	True
3	apt 1 789 s maple rd manhattan	True
4	robie house, 789 north main street	True
5	1111 S Figueroa St, Los Angeles, CA 90015	True
6	(staples center) 1111 S Figueroa St, Los Angeles	True
7	hello	False
8	NaN	False
9	NULL	False

`must_contain`¶

[13]:

df["valid"] = validate_address(df["address"], must_contain=("building", "city"))
df

[13]:

	address	valid
0	123 Pine Ave.	False
1	main st	False
2	1234 west main heights 57033	False
3	apt 1 789 s maple rd manhattan	False
4	robie house, 789 north main street	False
5	1111 S Figueroa St, Los Angeles, CA 90015	False
6	(staples center) 1111 S Figueroa St, Los Angeles	True
7	hello	False
8	NaN	False
9	NULL	False

US Street Addresses¶

Introduction¶

An example dirty dataset¶

1. Default clean_address()¶

2. Output formats¶

Splitting The Output¶

3. must_contain parameter¶

4. split parameter¶

5. inplace parameter¶

inplace and split¶

6. validate_address()¶

validate_address() on a pandas series¶

must_contain¶