The function clean_country() cleans a column containing country names and/or ISO 3166 country codes, and standardizes them in a desired format. The function validate_country() validates either a single country or a column of countries, returning True if the value is valid, and False otherwise. The countries/regions supported and the regular expressions used can be found on GitHub.
clean_country()
validate_country()
Countries can be converted to and from the following formats via the input_format and output_format parameters:
input_format
output_format
Short country name (name): “United States”
Official state name (official): “United States of America”
ISO 3166-1 alpha-2 (alpha-2): “US”
ISO 3166-1 alpha-3 (alpha-3): “USA”
ISO 3166-1 numeric (numeric): “840”
input_format can be set to “auto” which automatically infers the input format. A tuple of input formats may also be used to indicate that the input may be any of the given input formats.
The strict parameter allows for control over the type of matching used for the “name” and “official” input formats.
strict
False (default for clean_country()), search the input for a regex match
True (default for validate_country()), look for a direct match with a country value in the same format
The fuzzy_dist parameter sets the maximum edit distance (number of single character insertions, deletions or substitutions required to change one word into the other) allowed between the input and a country regex.
fuzzy_dist
0 (default), countries at most 0 edits from matching a regex are successfully cleaned
1, countries at most 1 edit from matching a regex are successfully cleaned
n, countries at most n edits from matching a regex are successfully cleaned
Invalid parsing is handled with the errors parameter:
errors
“coerce” (default): invalid parsing will be set to NaN
“ignore”: invalid parsing will return the input
“raise”: invalid parsing will raise an exception
After cleaning, a report is printed that provides the following information:
How many values were cleaned (the value must have been transformed).
How many values could not be parsed.
A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.
The following sections demonstrate the functionality of clean_country() and validate_country().
[1]:
import pandas as pd import numpy as np df = pd.DataFrame({ "country": [ "Canada", "foo canada bar", "cnada", "northern ireland", " ireland ", "congo, kinshasa", "congo, brazzaville", 304, "233", " tr ", "ARG", "hello", np.nan, "NULL" ] }) df
By default, the input_format parameter is set to “auto” (automatically determines the input format), the output_format parameter is set to “name”. The fuzzy_dist parameter is set to 0 and strict is False. The errors parameter is set to “coerce” (set NaN when parsing is invalid).
[2]:
from dataprep.clean import clean_country clean_country(df, "country")
Country Cleaning Report: 8 values cleaned (57.14%) 3 values unable to be parsed (21.43%), set to NaN Result contains 9 (64.29%) values in the correct format and 5 null values (35.71%)
Note “Canada” is considered not cleaned in the report since it’s cleaned value is the same as the input. Also, “northern ireland” is invalid because it is part of the United Kingdom. Kinshasa and Brazzaville are the capital cities of their respective countries.
This section demonstrates the supported country input formats.
If the input contains a match with one of the country regexes then it is successfully converted.
[3]:
clean_country(df, "country", input_format="name")
Country Cleaning Report: 4 values cleaned (28.57%) 7 values unable to be parsed (50.0%), set to NaN Result contains 5 (35.71%) values in the correct format and 9 null values (64.29%)
Does the same thing as input_format="name".
input_format="name"
[4]:
clean_country(df, "country", input_format="official")
Looks for a direct match with a ISO 3166-1 alpha-2 country code, case insensitive and ignoring leading and trailing whitespace.
[5]:
clean_country(df, "country", input_format="alpha-2")
Country Cleaning Report: 1 values cleaned (7.14%) 11 values unable to be parsed (78.57%), set to NaN Result contains 1 (7.14%) values in the correct format and 13 null values (92.86%)
Looks for a direct match with a ISO 3166-1 alpha-3 country code, case insensitive and ignoring leading and trailing whitespace.
[6]:
clean_country(df, "country", input_format="alpha-3")
Looks for a direct match with a ISO 3166-1 numeric country code, case insensitive and ignoring leading and trailing whitespace. Works on integers and strings.
[7]:
clean_country(df, "country", input_format="numeric")
Country Cleaning Report: 2 values cleaned (14.29%) 10 values unable to be parsed (71.43%), set to NaN Result contains 2 (14.29%) values in the correct format and 12 null values (85.71%)
A tuple containing any combination of input formats may be used to clean any of the given input formats.
[8]:
clean_country(df, "country", input_format=("name", "alpha-2"))
Country Cleaning Report: 5 values cleaned (35.71%) 6 values unable to be parsed (42.86%), set to NaN Result contains 6 (42.86%) values in the correct format and 8 null values (57.14%)
This section demonstrates the supported output country formats.
[9]:
clean_country(df, "country", output_format="official")
[10]:
clean_country(df, "country", output_format="alpha-2")
Country Cleaning Report: 9 values cleaned (64.29%) 3 values unable to be parsed (21.43%), set to NaN Result contains 9 (64.29%) values in the correct format and 5 null values (35.71%)
[11]:
clean_country(df, "country", output_format="alpha-3")
[12]:
clean_country(df, "country", output_format="numeric")
[13]:
clean_country(df, "country", input_format="alpha-2", output_format="official")
This parameter allows for control over the type of matching used for “name” and “official” input formats. When False, the input is searched for a regex match. When True, matching is done by looking for a direct match with a country in the same format.
[14]:
clean_country(df, "country", strict=True)
“foo canada bar”, “congo kinshasa” and “congo brazzaville” are now invalid because they are not a direct match with a country in the “name” or “official” formats.
The fuzzy_dist parameter sets the maximum edit distance (number of single character insertions, deletions or substitutions required to change one word into the other) allowed between the input and a country regex. If an input is successfully cleaned by clean_country() with fuzzy_dist=0 then that input with one character inserted, deleted or substituted will match with fuzzy_dist=1. This parameter only applies to the “name” and “official” input formats.
fuzzy_dist=0
fuzzy_dist=1
Countries at most one edit away from matching a regex are successfully cleaned.
[15]:
df = pd.DataFrame({ "country": [ "canada", "cnada", "australa", "xntarctica", "koreea", "cxnda", "afghnitan", "country: cnada", "foo indnesia bar" ] }) clean_country(df, "country", fuzzy_dist=1)
Country Cleaning Report: 7 values cleaned (77.78%) 2 values unable to be parsed (22.22%), set to NaN Result contains 7 (77.78%) values in the correct format and 2 null values (22.22%)
fuzzy_dist=2
Countries at most two edits away from matching a regex are successfully cleaned.
[16]:
clean_country(df, "country", fuzzy_dist=2)
Country Cleaning Report: 9 values cleaned (100.0%) Result contains 9 (100.0%) values in the correct format and 0 null values (0.0%)
inplace
This just deletes the given column from the returned dataframe. A new column containing cleaned coordinates is added with a title in the format "{original title}_clean".
"{original title}_clean"
[17]:
clean_country(df, "country", fuzzy_dist=2, inplace=True)
validate_country() returns True when the input is a valid country value otherwise it returns False. Valid types are the same as clean_country(). By default strict=True, as opposed to clean_country() which has strict set to False by default. The default input_type is “auto”.
strict=True
input_type
[18]:
from dataprep.clean import validate_country print(validate_country("switzerland")) print(validate_country("country = united states")) print(validate_country("country = united states", strict=False)) print(validate_country("ca")) print(validate_country(800))
True False True True True
Since strict=True by default, the inputs “foo canada bar”, “congo, kinshasa” and “congo, brazzaville” are invalid since they don’t directly match a country in the “name” or “official” formats.
[19]:
df = pd.DataFrame({ "country": [ "Canada", "foo canada bar", "cnada", "northern ireland", " ireland ", "congo, kinshasa", "congo, brazzaville", 304, "233", " tr ", "ARG", "hello", np.nan, "NULL" ] }) df["valid"] = validate_country(df["country"]) df
strict=False
For “name” and “official” input types the input is searched for a regex match.
[20]:
df["valid"] = validate_country(df["country"], strict=False) df
[21]:
df["valid"] = validate_country(df["country"], input_format="numeric") df
The country data and regular expressions used are based on the country_converter project.