The function clean_address() cleans a column containing United States street addresses and standardizes them in a desired format. The function validate_address() validates either a single address or a column of addresses, returning True if the value is valid, and False otherwise. Address parsing is done using the usaddress library.
clean_address()
validate_address()
Addresses can be converted to a specific format via the output_format parameter, the following keywords are supported. Any missing attributes are omitted.
output_format
house_number: (‘1234’)
street_prefix_abbr: (‘N’, ‘S’, ‘E’, or ‘W’)
street_prefix_full: (‘North’, ‘South’, ‘East’, or ‘West’)
street_name: (‘Main’)
street_suffix_abbr: (‘St’, ‘Ave’)
street_suffix_full: (‘Street’, ‘Avenue’)
apartment: (‘Apt 1’)
building: (‘Staples Center’)
city: (‘Los Angeles’)
state_abbr: (‘CA’)
state_full: (‘California’)
zipcode: (‘57903’)
The default output_format is “(building) house_number street_prefix_abbr street_name street_suffix_abbr, apartment, city, state_abbr zipcode”
The must_contain parameter takes a tuple containing parts of the address that must be included for the address to be successfully cleaned, the following keywords are supported.
must_contain
street_prefix: (‘N’, ‘North’)
street_suffix: (‘St’, ‘Avenue’)
state: (‘CA’, ‘California’)
The default value for must_contain is ("house_number", "street_name"). Therefore, by default addresses must contain a house number and street name to be successfully cleaned.
("house_number", "street_name")
Invalid parsing is handled with the errors parameter:
errors
“coerce” (default): invalid parsing will be set to NaN
“ignore”: invalid parsing will return the input
“raise”: invalid parsing will raise an exception
After cleaning, a report is printed that provides the following information:
How many values were cleaned (the value must have been transformed).
How many values could not be parsed.
A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.
The following sections demonstrate the functionality of clean_address() and validate_address().
[1]:
import pandas as pd import numpy as np df = pd.DataFrame( { "address": [ "123 Pine Ave.", "main st", "1234 west main heights 57033", "apt 1 789 s maple rd manhattan", "robie house, 789 north main street", "1111 S Figueroa St, Los Angeles, CA 90015", "(staples center) 1111 S Figueroa St, Los Angeles", "hello", np.nan, "NULL" ] } ) df
By default the output_format parameter is set to “(building) house_number street_prefix_abbr street_name street_suffix_abbr apartment, city, state_abbr zipcode” and the must_contain parameter is set ("house_number", "street_name"). The errors parameter is set to “coerce” (set NaN when parsing is invalid).
[2]:
from dataprep.clean import clean_address clean_address(df, "address")
Address Cleaning Report: 5 values cleaned (50.0%) 2 values unable to be parsed (20.0%), set to NaN Result contains 6 (60.0%) values in the correct format and 4 null values (40.0%)
Note that “123 Pine Ave.” is considered not cleaned in the report since its resulting format is the same as the input. Also, “main st” is invalid since it does not contain a house number.
[3]:
clean_address( df, "address", output_format="(zipcode) street_prefix_full street_name ~state_full~" )
Address Cleaning Report: 6 values cleaned (60.0%) 2 values unable to be parsed (20.0%), set to NaN Result contains 6 (60.0%) values in the correct format and 4 null values (40.0%)
[4]:
clean_address( df, "address", output_format="house_number street_name street_suffix_full (building)", )
A tab character can be placed between address keywords to split the output into separate columns. The column names are taken from the output format.
[5]:
clean_address( df, "address", output_format="house_number street_name \t state_full" )
This parameter takes a tuple containing parts of the address that must be included for the address to be successfully cleaned.
[6]:
clean_address( df, "address", must_contain=("house_number", "zipcode") )
Address Cleaning Report: 2 values cleaned (20.0%) 6 values unable to be parsed (60.0%), set to NaN Result contains 2 (20.0%) values in the correct format and 8 null values (80.0%)
split
The split parameter adds individual columns containing the cleaned address values to the given DataFrame.
[7]:
clean_address(df, "address", split=True)
Setting split to True is equivalent to placing tabs between each word in the output_format and removing all characters that are not part of an address keyword (ie. commas). Column names are taken from the address keywords in the output_format.
[8]:
clean_address( df, "address", split=True, output_format="house_number, street_name, building" )
inplace
This just deletes the given column from the returned dataframe. A new column containing cleaned addresses is added with a title in the format "{original title}_clean".
"{original title}_clean"
[9]:
clean_address(df, "address", inplace=True)
[10]:
clean_address(df, "address", inplace=True, split=True)
validate_address() returns True when the input is a valid address value otherwise it returns False. Valid types are the same as clean_address().
[11]:
from dataprep.clean import validate_address print(validate_address("123 main st")) print(validate_address("main st")) print(validate_address("apt 1 s maple rd manhattan", must_contain=("apartment",))) print(validate_address("(staples center) 1111 S Figueroa St, Los Angeles")) print(validate_address("789 North Maple Way Boston, MA"))
True False True True True
[12]:
df["valid"] = validate_address(df["address"]) df
[13]:
df["valid"] = validate_address(df["address"], must_contain=("building", "city")) df