API reference for the DataPrep.Clean subpackage.
Clean and standardize column headers for a DataFrame.
dataprep.clean.clean_headers.
clean_headers
Function to clean column headers (column names).
Read more in the User Guide.
df (Union[DataFrame, DataFrame]) – Dataframe from which column names are to be cleaned.
Union
DataFrame
case (str) –
str
’snake’: ‘column_name’
’kebab’: ‘column-name’
’camel’: ‘columnName’
’pascal’: ‘ColumnName’
’const’: ‘COLUMN_NAME’
’sentence’: ‘Column name’
’title’: ‘Column Name’
’lower’: ‘column name’
’upper’: ‘COLUMN NAME’
(default: ‘snake’)
replace (Optional[Dict[str, str]]) –
Optional
Dict
{‘old_value’: ‘new_value’}
(default: None)
remove_accents (bool) –
bool
If True, strip accents from the column names.
(default: True)
report (bool) –
If True, output the summary report. Otherwise, no report is outputted.
Examples
Clean column names by converting the names to camel case style, removing accents, and correcting a mispelling.
>>> df = pd.DataFrame({'FirstNom': ['Philip', 'Turanga'], 'lastName': ['Fry', 'Leela'], 'Téléphone': ['555-234-5678', '(604) 111-2335']}) >>> clean_headers(df, case='camel', replace={'Nom': 'Name'}) Column Headers Cleaning Report: 2 values cleaned (66.67%) firstName lastName telephone 0 Philip Fry 555-234-5678 1 Turanga Leela (604) 111-2335
Clean and validate a DataFrame column containing country names.
dataprep.clean.clean_country.
clean_country
Clean and standardize country names.
df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.
column (str) – The name of the column containing country names.
input_format (Union[str, Tuple[str, …]]) –
Tuple
’auto’: infer the input format
’name’: country name (‘United States’)
’official’: official state name (‘United States of America’)
’alpha-2’: alpha-2 code (‘US’)
’alpha-3’: alpha-3 code (‘USA’)
’numeric’: numeric code (840)
Can also be a tuple containing any combination of input formats, for example to clean a column containing alpha-2 and numeric codes set input_format to (‘alpha-2’, ‘numeric’).
(default: ‘auto’)
output_format (str) –
(default: ‘name’)
fuzzy_dist (int) –
int
The maximum edit distance (number of single character insertions, deletions or substitutions required to change one word into the other) between a country value and input that will count as a match. Only applies to ‘auto’, ‘name’ and ‘official’ input formats.
(default: 0)
strict (bool) –
If True, matching for input formats ‘name’ and ‘official’ are done by looking for a direct match. If False, matching is done by searching the input for a regex match.
(default: False)
inplace (bool) –
If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.
errors (str) –
‘coerce’: invalid parsing will be set to NaN.
‘ignore’: invalid parsing will return the input.
‘raise’: invalid parsing will raise an exception.
(default: ‘coerce’)
progress (bool) –
If True, display a progress bar.
>>> df = pd.DataFrame({'country': [' Canada ', 'US']}) >>> clean_country(df, 'country') Country Cleaning Report: 2 values cleaned (100.0%) Result contains 2 (100.0%) values in the correct format and 0 null values (0.0%) country country_clean 0 Canada Canada 1 US United States
validate_country
Validate country names.
x (Union[str, int, Series]) – pandas Series of countries or str/int country value.
Series
If True, matching for input formats ‘name’ and ‘official’ are done by looking for a direct match, if False, matching is done by searching the input for a regex match.
>>> validate_country('United States') True >>> df = pd.DataFrame({'country': ['Canada', 'NaN']}) >>> validate_country(df['country']) 0 True 1 False Name: country, dtype: bool
Union[bool, Series]
Clean and validate a DataFrame column containing dates and times.
dataprep.clean.clean_date.
clean_date
Clean and standardize dates and times.
column (str) – The name of the column containing dates.
The desired format of the date.
(default: ‘YYYY-MM-DD hh:mm:ss’)
input_timezone (str) –
Time zone of the input date.
(default: ‘UTC’)
output_timezone (str) –
The desired time zone of the date.
(default: ‘’)
fix_missing (str) –
’minimum’: fill hours, minutes, seconds with zeros, and month, day, year with January 1st, 2000.
’current’: fill with the current date and time.
’empty’: don’t fill missing components.
(default: ‘minimum’)
infer_day_first (bool) – If True, the program will infer the ambiguous format ‘09-10-03’ and ‘25-09-03’ according to ‘25-09-03’ (day is the number of first position). The result should be ‘2003-10-09’ and ‘2003-09-25’. If False, do nothing of inferring. The result should be ‘2003-09-10’ and ‘2003-09-25’. (default: False)
>>> df = pd.DataFrame({'date': ['Thu Sep 25 2003', 'Thu 10:36:28', '2003 09 25']}) >>> clean_date(df, 'date') Dates Cleaning Report: 3 values cleaned (100.0%) Result contains 3 (100.0%) values in the correct format and 0 null values (0.0%) date date_clean 0 Thu Sep 25 2003 2003-09-25 00:00:00 1 Thu 10:36:28 2000-01-01 10:36:28 2 2003 09 25 2003-09-25 00:00:00
validate_date
Validate dates and times.
date (Union[str, Series]) – pandas Series of dates or a date string
>>> validate_date('3rd of May 2001') True >>> df = pd.DataFrame({'date': ['2003/09/25', 'This is Sep.']}) >>> validate_date(df['date']) 0 True 1 False Name: date, dtype: bool
Clean a DataFrame column containing duplicate values.
dataprep.clean.clean_duplication.
UserInterface
Bases: object
object
A user interface used by the clean_duplication function.
display
Display the UI.
Box
clean_duplication
Cleans and standardizes duplicate values in a DataFrame.
column (str) – The name of the column containing duplicate values.
df_var_name (str) –
Optional parameter containing the variable name of the DataFrame being cleaned. This is only needed for legacy compatibility with the original veraion of this function, which needed it to produce correct exported code.
(default: ‘default’)
page_size (int) –
The number of clusters to display on each page.
(default: 5)
After running clean_duplication(df, ‘city’) below in a notebook, a GUI will appear. Select the merge checkbox, press merge and re-cluster, then press finish.
>>> df = pd.DataFrame({'city': ['New York', 'new york']}) >>> clean_duplication(df, 'city')
city
0 New York 1 New York
Clean and validate a DataFrame column containing email addresses.
dataprep.clean.clean_email.
clean_email
Clean and standardize email address.
column (str) – The name of the column containing email addresses.
remove_whitespace (bool) –
If True, remove all whitespace from the input value before verifying and cleaning it.
fix_domain (bool) –
Swap neighboring characters.
Add a single character.
Remove a single character.
Swap each character with its nearby keys on the qwerty keyboard.
The first valid domain found will be returned.
split (bool) –
If True, split a column containing email addresses into one column for the usernames and another column for the domains.
If True, output the summary report. Else, no report is outputted.
>>> df = pd.DataFrame({'email': ['Abc.example.com', 'Abc@example.com', 'H ELLO@hotmal.COM']}) >>> clean_email(df, 'email') Email Cleaning Report: 2 values with bad format (66.67%) Result contains 1 (33.33%) values in the correct format and 2 null values (66.67%) email email_clean 0 Abc.example.com NaN 1 Abc@example.com abc@example.com 2 H ELLO@hotmal.COM NaN
validate_email
Validate email addresses.
x (Union[str, Series]) – pandas Series of emails or a string containing an email.
>>> validate_email('Abc.example@com') False >>> df = pd.DataFrame({'email': ['abc.example.com', 'HELLO@HOTMAIL.COM']}) >>> validate_email(df['email']) 0 False 1 True Name: email, dtype: bool
Clean and validate a DataFrame column containing geographic coordinates.
dataprep.clean.clean_lat_long.
clean_lat_long
Clean and standardize latitude and longitude coordinates.
lat_long (Optional[str]) – The name of the column containing latitude and longitude coordinates.
lat_col (Optional[str]) –
The name of the column containing latitude coordinates.
If specified, the parameter lat_long must be None.
long_col (Optional[str]) –
The name of the column containing longitude coordinates.
’dd’: decimal degrees (51.4934, 0.0098)
’ddh’: decimal degrees with hemisphere (‘51.4934° N, 0.0098° E’)
’dm’: degrees minutes (‘51° 29.604′ N, 0° 0.588′ E’)
’dms’: degrees minutes seconds (‘51° 29′ 36.24″ N, 0° 0′ 35.28″ E’)
(default: ‘dd’)
If True, split the latitude and longitude coordinates into one column for latitude and a separate column for longitude. Otherwise, merge the latitude and longitude coordinates into one column.
If True, delete the column(s) containing the data that was cleaned. Otherwise, keep the original column(s).
Split a column containing latitude and longitude strings into separate columns in decimal degrees format.
>>> df = pd.DataFrame({'coord': ['51° 29′ 36.24″ N, 0° 0′ 35.28″ E', '51.4934° N, 0.0098° E']}) >>> clean_lat_long(df, 'coord', split=True) Latitude and Longitude Cleaning Report: 2 values cleaned (100.0%) Result contains 2 (100.0%) values in the correct format and 0 null values (0.0%) coord latitude longitude 0 51° 29′ 36.24″ N, 0° 0′ 35.28″ E 51.4934 0.0098 1 51.4934° N, 0.0098° E 51.4934 0.0098
validate_lat_long
Validate latitude and longitude coordinates.
x (Union[Series, str, float, Tuple[float, float]]) – A pandas Series, string, float, or tuple of floats, containing the latitude and/or longitude coordinates to be validated.
float
lat_long (bool) –
If True, valid values contain latitude and longitude coordinates. Parameters lat and lon must be False if lat_long is True.
lat (bool) –
lat_long and lon must be False if lat is True.
lon (bool) –
If True, valid values contain only longitude coordinates. Parameters lat_long and lat must be False if lon is True.
Validate a coordinate string or series of coordinates.
>>> validate_lat_long('51° 29′ 36.24″ N, 0° 0′ 35.28″ E') True >>> df = pd.DataFrame({'coordinates', ['51° 29′ 36.24″ N, 0° 0′ 35.28″ E', 'NaN']}) >>> validate_lat_long(df['coordinates']) 0 True 1 False Name: coordinates, dtype: bool
Clean and validate a DataFrame column containing IP addresses.
dataprep.clean.clean_ip.
clean_ip
Clean and standardize IP addresses.
column (str) – The name of the column containing IP addresses.
input_format (str) –
’auto’: parse both ipv4 and ipv6 addresses.
’ipv4’: only parse ipv4 addresses.
’ipv6’: only parse ipv6 addresses.
’compressed’: compressed representation (‘12.3.4.5’)
’full’: full representation (‘0012.0003.0004.0005’)
’binary’: binary representation (‘00001100000000110000010000000101’)
’hexa’: hexadecimal representation (‘0xc030405’)
’integer’: integer representation (201524229)
’packed’: packed binary representation (big-endian, a bytes object)
(default: ‘compressed’)
>>> df = pd.DataFrame({'ip': ['2001:0db8:85a3:0000:0000:8a2e:0370:7334', '233.5.6.000']}) >>> clean_ip(df, 'ip') IP Cleaning Report: 2 values cleaned (100.0%) Result contains 2 (100.0%) values in the correct format and 0 null values (0.0%) ip ip_clean 0 2001:0db8:85a3:0000:0000:8a2e:0370:7334 2001:db8:85a3::8a2e:370:7334 1 233.5.6.000 233.5.6.0
Union[DataFrame, DataFrame]
validate_ip
Validate IP addresses.
x (Union[str, Series]) – pandas Series of IP addresses or a str ip address value
’auto’: validate both ipv4 and ipv6 addresses.
’ipv4’: only validate ipv4 addresses.
’ipv6’: only validate ipv6 addresses.
>>> validate_ip('fdf8:f53b:82e4::53') True >>> df = pd.DataFrame({'ip': ['fdf8:f53b:82e4::53', None]}) >>> validate_ip(df['ip']) 0 True 1 False Name: ip, dtype: bool
Clean and validate a DataFrame column containing phone numbers.
dataprep.clean.clean_phone.
clean_phone
Clean and standardize phone numbers.
column (str) – The name of the column containing phone numbers.
’nanp’: ‘NPA-NXX-XXXX’
’e164’: ‘+1NPANXXXXXX’
’national’: ‘(NPA) NXX-XXXX’
(default: ‘nanp’)
’empty’: leave the missing component as is.
’auto’: set the country code to a default value (1).
(default: ‘empty’)
If True, split a column containing a phone number into different columns containing individual components.
If True, enable the progress bar.
>>> df = pd.DataFrame({'phone': ['555-234-5678', '(555) 234-5678', '555.234.5678']}) >>> clean_phone(df, 'phone') Phone Number Cleaning Report: 2 values cleaned (66.67%) Result contains 3 (100.0%) values in the correct format and 0 null values (0.0%) phone phone_clean 0 555-234-5678 555-234-5678 1 (555) 234-5678 555-234-5678 2 555.234.5678 555-234-5678
validate_phone
Validate phone numbers.
x (Union[str, Series]) – pandas Series of phone numbers or a string/int containing a phone number.
>>> validate_phone('1 800 234 6789') True >>> df = pd.DataFrame({'phone': [1234567, '1234']}) >>> validate_phone(df['phone']) 0 True 1 False Name: phone, dtype: bool
Clean a DataFrame column containing text data.
dataprep.clean.clean_text.
clean_text
Clean text data in a DataFrame column.
column (str) – The name of the column containing text data.
pipeline (Optional[List[Dict[str, Any]]]) –
List
Any
A list of cleaning functions to be applied to the column. If None, use the default pipeline. See the User Guide for more information on customizing the pipeline.
stopwords (Optional[Set[str]]) –
Set
A set of words to be removed from the column. If None, use NLTK’s stopwords.
Clean a column of text data using the default pipeline.
>>> df = pd.DataFrame({"text": ["This show was an amazing, fresh & innovative idea in the 70's when it first aired."]}) >>> clean_text(df, 'text') text 0 show amazing fresh innovative idea first aired
default_text_pipeline
Return a list of dictionaries representing the functions in the default pipeline. Use as a template for creating a custom pipeline.
>>> default_text_pipeline() [{'operator': 'fillna'}, {'operator': 'lowercase'}, {'operator': 'remove_digits'}, {'operator': 'remove_html'}, {'operator': 'remove_urls'}, {'operator': 'remove_punctuation'}, {'operator': 'remove_accents'}, {'operator': 'remove_stopwords', 'parameters': {'stopwords': None}}, {'operator': 'remove_whitespace'}]
List[Dict[str, Any]]
Clean and validate a DataFrame column containing URLs.
dataprep.clean.clean_url.
clean_url
Clean and standardize URLs.
column (str) – The name of the column containing URL addresses.
remove_auth (Union[bool, List[str]]) –
Can be a boolean value or list of strings representing the names of Auth queries to be removed. If True, remove default Auth values. If False, do not remove Auth values.
If True, split the URL into the scheme, hostname, queries, cleaned_url columns. If False, return a column of dictionaries with the relavant information (e.g., scheme, hostname, etc.) as key-value pairs.
Split a URL into its components.
>>> df = pd.DataFrame({'url': ['https://github.com/sfu-db/dataprep','https://www.google.com/']}) >>> clean_url(df, 'url') URL Cleaning Report: 2 values parsed (100.0%) Result contains 2 (100.0%) parsed key-value pairs and 0 null values (0.0%) url url_details 0 https://github.com/sfu-db/dataprep {'scheme': 'https', 'host': 'github.com', 'url... 1 https://www.google.com/ {'scheme': 'https', 'host': 'www.google.com', ...
validate_url
Validate URLs.
x (Union[str, Series]) – pandas Series of URLs or string URL.
>>> validate_url('https://github.com/sfu-db/dataprep') True >>> df = pd.DataFrame({'url': ['https://www.google.com/', 'NaN']}) >>> validate_url(df['url']) 0 True 1 False Name: url, dtype: bool
Clean and validate a DataFrame column containing US street addresses.
dataprep.clean.clean_address.
clean_address
Clean and standardize US street addresses.
column (str) – The name of the column containing addresses.
’house_number’: ‘1234’
’street_prefix_abbr’: ‘N’, ‘S’, ‘E’, or ‘W’
’street_prefix_full’: ‘North’, ‘South’, ‘East’, or ‘West’
’street_name’: ‘Main’
’street_suffix_abbr’: ‘St’, ‘Ave’
’street_suffix_full’: ‘Street’, ‘Avenue’
’apartment’: ‘Apt 1’
’building’: ‘Staples Center’
’city’: ‘Los Angeles’
’state_abbr’: ‘CA’
’state_full’: ‘California’
’zipcode’: ‘57903’
The output_format can contain ‘\t’ characters to specify how to split the output into columns.
(default: ‘(building) house_number street_prefix_abbr street_name street_suffix_abbr, apartment, city, state_abbr zipcode’)
must_contain (Tuple[str, …]) –
A tuple containing parts of the address that must be included for the address to be successfully cleaned.
’house_number’: ‘1234’ ’street_prefix’: ‘N’, ‘North’ ’street_name’: ‘Main’ ’street_suffix’: ‘St’, ‘Avenue’ ’apartment’: ‘Apt 1’ ’building’: ‘Staples Center’ ’city’: ‘Los Angeles’ ’state’: ‘CA’, ‘California’ ’zipcode’: ‘57903’
’street_prefix’: ‘N’, ‘North’
’street_suffix’: ‘St’, ‘Avenue’
’state’: ‘CA’, ‘California’
(default: (‘house_number’, ‘street_name’))
If True, each component of the address specified by the output_format parameter will be put into it’s own column.
For example if output_format = “house_number street_name” and split = True, then there will be one column for house_number and another for street_name.
Clean addresses and add the house number and street name to separate columns.
>>> df = pd.DataFrame({'address': ['123 pine avenue', '1234 w main st 57033']}) >>> clean_address(df, 'address', output_format='house_number \t street_name') Address Cleaning Report: 2 values cleaned (100.0%) Result contains 2 (100.0%) values in the correct format and 0 null values (0.0%) address house_number street_name 0 123 pine avenue 123 Pine 1 1234 w main st 57033 1234 Main
validate_address
Validate US street addresses.
x (Union[str, Series]) – pandas Series of addresses or a string containing an address.
>>> df = pd.DataFrame({'address': ['123 pine avenue', 'NULL']}) >>> validate_address(df['address']) 0 True 1 False Name: address, dtype: bool
Clean and validate a DataFrame column containing ISBN numbers.
dataprep.clean.clean_isbn.
clean_isbn
Clean ISBN type data in a DataFrame column.
column (str) – The name of the column containing data of ISBN type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators. If output_format = ‘standard’, return string with proper separators. If output_format = ‘isbn13’, return ISBN string with 13 digits. If output_format = ‘isbn10’, return ISBN string with 10 digits.
(default: “standard”)
each component of derived from its number string will be put into its own column.
How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.
Clean a column of ISBN data.
>>> df = pd.DataFrame({{ "isbn": [ "978-9024538270", "978-9024538271"] }) >>> clean_isbn(df, 'isbn', inplace=True) isbn_clean 0 978-90-245-3827-0 1 NaN
validate_isbn
Validate if a data cell is ISBN in a DataFrame column. For each cell, return True or False.
df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.
column (str) – The name of the column to be validated.
Union[bool, Series, DataFrame]
Implement clean_ml function
dataprep.clean.clean_ml.
clean_ml
This function transforms an arbitrary tabular dataset into a format that’s suitable for a typical ML application.
training_df (Union[DataFrame, DataFrame]) – Training dataframe. Pandas or Dask DataFrame.
test_df (Union[DataFrame, DataFrame]) – Test dataframe. Pandas or Dask DataFrame.
target (str) – Name of target column. String.
cat_imputation (str) –
The mode of imputation for categorical columns. If it equals to “constant”,
then all missing values are filled with fill_val.
then all missing values are filled with most frequent value.
then all categorical columns with missing values will be dropped.
cat_null_value (Optional[List[Any]]) – Specified categorical null values which should be recognized.
fill_val (str) –
num_imputation (str) –
The mode of imputation for numerical columns. If it equals to “mean”,
then all missing values are filled with mean value.
then all missing values are filled with median value.
then all numerical columns with missing values will be dropped.
num_null_value (Optional[List[Any]]) – Specified numerical null values which should be recognized.
cat_encoding (str) – The mode of encoding categorical columns. If it equals to “one_hot”, do one-hot encoding. If it equals to “no_encoding”, nothing will be done.
variance_threshold (bool) –
then dropping numerical columns with variance less than variance.
variance (float) – Variance value when variance_threshold = True.
num_scaling (str) – The mode of scaling for numerical columns. If it equals to “standardize”, do standardize for all numerical columns. If it equals to “minmax”, do minmax scaling for all numerical columns. If it equals to “maxabs”, do maxabs scaling for all numerical columns. If it equals to “no_scaling”, nothing will be done.
include_operators (Optional[List[str]]) – Components included for clean_ml, like “one_hot”, “standardize”, etc.
exclude_operators (Optional[List[str]]) – Components excluded for clean_ml, like “one_hot”, “standardize”, etc.
Tuple[DataFrame, DataFrame]
format_data_with_customized_cat
This function transforms an arbitrary tabular dataset into a format that’s suitable for a typical ML application. Customized categorical pipeline and related parameters should be provided by users
training_row (Series) – One column of training dataset. Dask Series.
test_row (Series) – One column of test dataset. Dask Series.
variance_threshold (bool) – If it is True, then dropping numerical columns with variance less than variance.
customized_cat_pipeline (Optional[List[Dict[str, Any]]]) – User-specified pipeline managing categorical columns.
Tuple[Series, Series]
format_data_with_customized_cat_and_num
This function transforms an arbitrary tabular dataset into a format that’s suitable for a typical ML application. Both customized pipeline managing categorical columns and numerical columns should be provided.
customized_num_pipeline (Optional[List[Dict[str, Any]]]) – User-specified pipeline managing numerical columns.
format_data_with_customized_num
This function transforms an arbitrary tabular dataset into a format that’s suitable for a typical ML application. Customized numerical pipeline and related parameters should be provided by users
fill_val (str) – When cat_imputation = “constant”, then all missing values are filled with fill_val.
format_data_with_default
This function transforms an arbitrary tabular dataset into a format that’s suitable for a typical ML application. No customized pipeline should be provided. Use default pipeline.
Clean and validate a DataFrame column containing Australian Business Numbers (ABNs).
dataprep.clean.clean_au_abn.
clean_au_abn
Clean Australian Business Numbers (ABNs) type data in a DataFrame column.
col – The name of the column containing data of ABN type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.
Clean a column of ABN data.
>>> df = pd.DataFrame({{ "abn": [ "51824753556", "99999999999",] }) >>> clean_au_abn(df, 'abn') abn abn_clean 0 51824753556 51 824 753 556 1 99999999999 NaN
validate_au_abn
Validate if a data cell is ABN in a DataFrame column. For each cell, return True or False.
col – The name of the column to be validated.
Clean and validate a DataFrame column containing Australian Company Numbers (ACNs).
dataprep.clean.clean_au_acn.
clean_au_acn
Clean Australian Company Numbers (ACNs) type data in a DataFrame column.
col – The name of the column containing data of ACN type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘abn’, convert the number to an Australian Business Number (ABN).
Clean a column of ACN data.
>>> df = pd.DataFrame({{ "acn": [ "004085616", "999 999 999"] }) >>> clean_au_acn(df, 'acn') acn acn_clean 0 004085616 004 085 616 1 999 999 999 NaN
validate_au_acn
Validate if a data cell is ACN in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Australian Tax File Numbers (TFNs).
dataprep.clean.clean_au_tfn.
clean_au_tfn
Clean Australian Tax File Numbers (TFNs) type data in a DataFrame column.
col – The name of the column containing data of TFN type.
Clean a column of TFN data.
>>> df = pd.DataFrame({ "tfn": [ "123456782", "999 999 999"] }) >>> clean_au_tfn(df, 'tfn') tfn tfn_clean 0 123456782 123 456 782 1 999 999 999 NaN
validate_au_tfn
Validate if a data cell is TFN in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Belgian IBANs.
dataprep.clean.clean_be_iban.
clean_be_iban
Clean Belgian IBAN (International Bank Account Number) type data in a DataFrame column.
col – The name of the column containing data of Belgian IBAN type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘bic’, return the BIC for the bank that this number refers to.
Clean a column of Belgian IBANs data.
>>> df = pd.DataFrame({{ "be_iban": [ "BE32 123-4567890-02", "BE41091811735141"] }) >>> clean_be_iban(df, 'be_iban') be_iban be_iban_clean 0 BE32 123-4567890-02 BE32123456789002 1 BE41091811735141 NaN
validate_be_iban
Validate if a data cell is Belgian IBAN in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Belgian VAT numbers (VATs).
dataprep.clean.clean_be_vat.
clean_be_vat
Clean Belgian VAT numbers (VATs) type data in a DataFrame column.
col – The name of the column containing data of VAT type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of VAT, the compact format is the same as the standard one.
Clean a column of VAT data.
>>> df = pd.DataFrame({{ "vat": [ 'BE403019261', 'BE431150351',] }) >>> clean_be_vat(df, 'vat') vat vat_clean 0 BE403019261 0403019261 1 BE431150351 NaN
validate_be_vat
Validate if a data cell is VAT in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Bulgarian national identification numbers (EGNs).
dataprep.clean.clean_bg_egn.
clean_bg_egn
Clean Bulgarian national identification numbers (EGNs) type data in a DataFrame column.
col – The name of the column containing data of EGN type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, return the birth date attained from the number. Note: in the case of EGN, the compact format is the same as the standard one.
Clean a column of EGN data.
>>> df = pd.DataFrame({{ "egn": [ '752316 926 3', '7552A10004'] }) >>> clean_bg_egn(df, 'egn') egn egn_clean 0 752316 926 3 7523169263 1 7552A10004 NaN
validate_bg_egn
Validate if a data cell is EGN in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Bulgarian VAT numbers (VATs).
dataprep.clean.clean_bg_vat.
clean_bg_vat
Clean Bulgarian VAT numbers (VATs) type data in a DataFrame column.
>>> df = pd.DataFrame({{ "vat": [ 'BG 175 074 752', '175074752', '175074751'] }) >>> clean_bg_vat(df, 'vat') vat vat_clean 0 BG 175 074 752 175074752 1 175074752 175074752 2 175074751 NaN
validate_bg_vat
Clean and validate a DataFrame column containing Belarusian UNP numbers (UNPs).
dataprep.clean.clean_by_unp.
clean_by_unp
Clean Belarusian UNP numbers (UNPs) type data in a DataFrame column.
col – The name of the column containing data of UNP type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of UNP, the compact format is the same as the standard one.
Clean a column of UNP data.
>>> df = pd.DataFrame({{ "unp": [ '200988541', 'УНП MA1953684', '200988542'] }) >>> clean_by_unp(df, 'unp') unp unp_clean 0 200988541 200988541 1 УНП MA1953684 MA1953684 2 200988542 NaN
validate_by_unp
Validate if a data cell is UNP in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Canadian Business Numbers (BNs).
dataprep.clean.clean_ca_bn.
clean_ca_bn
Clean Canadian Business Numbers (BNs) type data in a DataFrame column.
col – The name of the column containing data of BN type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of BN, the compact format is the same as the standard one.
Clean a column of BN data.
>>> df = pd.DataFrame({{ "bn": [ '12302 6635', '12302 6635 RC 0001' '12345678Z'] }) >>> clean_ca_bn(df, 'bn') bn bn_clean 0 12302 6635 123026635 1 12302 6635 RC 0001 123026635RC0001 2 12345678Z NaN
validate_ca_bn
Validate if a data cell is BN in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Swiss EinzahlungsSchein mit Referenznummer (ESRs).
dataprep.clean.clean_ch_esr.
clean_ch_esr
Clean Swiss EinzahlungsSchein mit Referenznummer (ESRs) type data in a DataFrame column.
col – The name of the column containing data of ESR type.
Clean a column of ESR data.
>>> df = pd.DataFrame({{ "esr": [ "18 78583", "210000000003139471430009016"] }) >>> clean_ch_esr(df, 'esr') esr esr_clean 0 18 78583 00 00000 00000 00000 00018 78583 1 210000000003139471430009016 NaN
validate_ch_esr
Validate if a data cell is ESR in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Swiss social security numbers (SSNs).
dataprep.clean.clean_ch_ssn.
clean_ch_ssn
Clean Swiss social security numbers (SSNs) type data in a DataFrame column.
col – The name of the column containing data of SSN type.
Clean a column of SSN data.
>>> df = pd.DataFrame({{ "ssn": [ '7569217076985', '756.9217.0769.84',] }) >>> clean_ch_ssn(df, 'ssn') ssn ssn_clean 0 7569217076985 756.9217.0769.85 1 756.9217.0769.84 NaN
validate_ch_ssn
Validate if a data cell is SSN in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Swiss business identifiers (UIDs).
dataprep.clean.clean_ch_uid.
clean_ch_uid
Clean Swiss business identifiers (UIDs) type data in a DataFrame column.
col – The name of the column containing data of UID type.
Clean a column of UID data.
>>> df = pd.DataFrame({{ "uid": [ 'CHE100155212', 'CHE-100.155.213',] }) >>> clean_ch_uid(df, 'uid') uid uid_clean 0 CHE100155212 CHE-100.155.212 1 CHE-100.155.213 NaN
validate_ch_uid
Validate if a data cell is UID in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Swiss VAT numbers (VATs).
dataprep.clean.clean_ch_vat.
clean_ch_vat
Clean Swiss VAT numbers (VATs) type data in a DataFrame column.
>>> df = pd.DataFrame({{ "vat": [ 'CHE107787577IVA', 'CHE-107.787.578 IVA',] }) >>> clean_ch_vat(df, 'vat') vat vat_clean 0 CHE107787577IVA CHE-107.787.577 IVA 1 CHE-107.787.578 IVA NaN
validate_ch_vat
Clean and validate a DataFrame column containing Chile RUT/RUN numbers (RUTs).
dataprep.clean.clean_cl_rut.
clean_cl_rut
Clean Chile RUT/RUN numbers (RUTs) type data in a DataFrame column.
col – The name of the column containing data of RUT type.
Clean a column of RUT data.
>>> df = pd.DataFrame({{ "rut": [ "125319092", "76086A28-5"] }) >>> clean_cl_rut(df, 'rut') rut rut_clean 0 125319092 12.531.909-2 1 76086A28-5 NaN
validate_cl_rut
Validate if a data cell is RUT in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Chinese Resident Identity Card Number (RICs).
dataprep.clean.clean_cn_ric.
clean_cn_ric
Clean Chinese Resident Identity Card Number (RICs) type data in a DataFrame column.
col – The name of the column containing data of RIC type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, return the birth date of the person. If output_format = ‘birthplace’, return the place of birth of the person. Note: in the case of RIC, the compact format is the same as the standard one.
Clean a column of RIC data.
>>> df = pd.DataFrame({{ "ric": [ "360426199101010071", "99999999999"] }) >>> clean_cn_ric(df, 'ric') ric ric_clean 0 360426199101010071 51 824 753 556 1 99999999999 NaN
validate_cn_ric
Validate if a data cell is RIC in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Colombian identity codes (NITs).
dataprep.clean.clean_co_nit.
clean_co_nit
Clean Colombian identity codes (NITs) type data in a DataFrame column.
col – The name of the column containing data of NIT type.
Clean a column of NIT data.
>>> df = pd.DataFrame({{ "nit": [ "2131234321", "2131234325"] }) >>> clean_co_nit(df, 'nit') nit nit_clean 0 2131234321 213.123.432-1 1 2131234325 NaN
validate_co_nit
Validate if a data cell is NIT in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Costa Rica physical person ID number (CPFs).
dataprep.clean.clean_cr_cpf.
clean_cr_cpf
Clean Costa Rica physical person ID number (CPFs) type data in a DataFrame column.
col – The name of the column containing data of CPF type.
Clean a column of CPF data.
>>> df = pd.DataFrame({{ "cpf": [ "1-613-584", "30-1234-1234"] }) >>> clean_cr_cpf(df, 'cpf') cpf cpf_clean 0 1-613-584 01-0613-0584 1 30-1234-1234 NaN
validate_cr_cpf
Validate if a data cell is CPF in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Costa Rica tax number (CPJs).
dataprep.clean.clean_cr_cpj.
clean_cr_cpj
Clean Costa Rica tax number (CPJs) type data in a DataFrame column.
col – The name of the column containing data of CPJ type.
Clean a column of CPJ data.
>>> df = pd.DataFrame({{ "cpj": [ "4 000 042138", "3-534-123559"] }) >>> clean_cr_cpj(df, 'cpj') cpj cpj_clean 0 4 000 042138 4-000-042138 1 3-534-123559 NaN
validate_cr_cpj
Validate if a data cell is CPJ in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Costa Rica foreigners ID number (CRs).
dataprep.clean.clean_cr_cr.
clean_cr_cr
Clean Costa Rica foreigners ID number (CRs) type data in a DataFrame column.
col – The name of the column containing data of CR type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of CR, the compact format is the same as the standard one.
Clean a column of CR data.
>>> df = pd.DataFrame({{ "cr": [ '122200569906', '12345678',] }) >>> clean_cr_cr(df, 'cr') cr cr_clean 0 122200569906 122200569906 1 12345678 NaN
validate_cr_cr
Validate if a data cell is CR in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Cuban identity card numbers (NIs).
dataprep.clean.clean_cu_ni.
clean_cu_ni
Clean Cuban identity card numbers (NIs) type data in a DataFrame column.
col – The name of the column containing data of NI type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, return the date of birth. If output_format = ‘gender’, return the gender (M/F) from the person’s NI. Note: in the case of NI, the compact format is the same as the standard one.
Clean a column of NI data.
>>> df = pd.DataFrame({{ "ni": [ '91021027775', '9102102777A',] }) >>> clean_cu_ni(df, 'ni') ni ni_clean 0 91021027775 91021027775 1 9102102777A NaN
validate_cu_ni
Validate if a data cell is NI in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Cypriot VAT number (VATs).
dataprep.clean.clean_cy_vat.
clean_cy_vat
Clean Cypriot VAT number (VATs) type data in a DataFrame column.
>>> df = pd.DataFrame({{ "vat": [ 'CY-10259033P', 'CY-10259033Z',] }) >>> clean_cy_vat(df, 'vat') vat vat_clean 0 CY-10259033P 10259033P 1 CY-10259033Z NaN
validate_cy_vat
Clean and validate a DataFrame column containing Czech VAT number (DICs).
dataprep.clean.clean_cz_dic.
clean_cz_dic
Clean Czech VAT number (DICs) type data in a DataFrame column.
col – The name of the column containing data of DIC type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of DIC, the compact format is the same as the standard one.
Clean a column of DIC data.
>>> df = pd.DataFrame({{ "dic": [ 'CZ 25123891', '25123890',] }) >>> clean_cz_dic(df, 'dic') dic dic_clean 0 CZ 25123891 25123891 1 25123890 NaN
validate_cz_dic
Validate if a data cell is DIC in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Czech birth numbers (RCs).
dataprep.clean.clean_cz_rc.
clean_cz_rc
Clean Czech birth numbers (RCs) type data in a DataFrame column.
col – The name of the column containing data of RC type.
Clean a column of RC data.
>>> df = pd.DataFrame({{ "rc": [ "7103192745", "7103192746"] }) >>> clean_cz_rc(df, 'rc') rc rc_clean 0 7103192745 710319/2745 1 7103192746 NaN
validate_cz_rc
Validate if a data cell is RC in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing German company registry id (handelsregisternummer).
dataprep.clean.clean_de_handelsregisternummer.
clean_de_handelsregisternummer
Clean German company registry id type data in a DataFrame column.
col – The name of the column containing data of handelsregisternummer type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of handelsregisternummer, the compact format is the same as the standard one.
Clean a column of handelsregisternummer data.
>>> df = pd.DataFrame({{ "handelsregisternummer": [ 'Aachen HRA 11223', 'Aachen HRC 44123',] }) >>> clean_de_handelsregisternummer(df, 'handelsregisternummer') handelsregisternummer handelsregisternummer_clean 0 Aachen HRA 11223 Aachen HRA 11223 1 Aachen HRC 44123 NaN
validate_de_handelsregisternummer
Validate if a data cell is handelsregisternummer in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing German personal tax number (IDNRs).
dataprep.clean.clean_de_idnr.
clean_de_idnr
Clean German personal tax number (IDNRs) type data in a DataFrame column.
col – The name of the column containing data of IDNR type.
Clean a column of IDNR data.
>>> df = pd.DataFrame({{ "idnr": [ "36574261809", "36554266806"] }) >>> clean_de_idnr(df, 'idnr') idnr idnr_clean 0 36574261809 36 574 261 809 1 36554266806 NaN
validate_de_idnr
Validate if a data cell is IDNR in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing German tax numbers (STNRs).
dataprep.clean.clean_de_stnr.
clean_de_stnr
Clean German tax numbers (STNRs) type data in a DataFrame column.
col – The name of the column containing data of STNR type.
Clean a column of STNR data.
>>> df = pd.DataFrame({{ "stnr": [ "181/815/0815 5", "136695978"] }) >>> clean_de_stnr(df, 'stnr') stnr stnr_clean 0 181/815/0815 5 181/815/08155 1 136695978 NaN
validate_de_stnr
Validate if a data cell is STNR in a DataFrame column. For each cell, return True or False. The region can be supplied to verify that the number is assigned in that region.
region (Optional[str]) –
Specify the region that the number belongs to.
Clean and validate a DataFrame column containing German VAT numbers (VATs).
dataprep.clean.clean_de_vat.
clean_de_vat
Clean German VAT numberss (VATs) type data in a DataFrame column.
>>> df = pd.DataFrame({{ "vat": [ 'DE 136,695 976', '136695978'] }) >>> clean_de_vat(df, 'vat') vat vat_clean 0 DE 136,695 976 136695976 1 136695978 NaN
validate_de_vat
Clean and validate a DataFrame column containing German Securities Identification Codes (WKNs).
dataprep.clean.clean_de_wkn.
clean_de_wkn
Clean Wertpapierkennnummer (WKNs) type data in a DataFrame column.
col – The name of the column containing data of WKN type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘isin’, convert the number to an ISIN. Note: in the case of WKN, the compact format is the same as the standard one.
Clean a column of WKN data.
>>> df = pd.DataFrame({{ "wkn": [ 'A0MNRK', 'AOMNRK'] }) >>> clean_de_wkn(df, 'wkn') wkn wkn_clean 0 A0MNRK A0MNRK 1 AOMNRK NaN
validate_de_wkn
Validate if a data cell is WKN in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Danish citizen number (CPRs).
dataprep.clean.clean_dk_cpr.
clean_dk_cpr
Clean Danish citizen number (CPRs) type data in a DataFrame column.
col – The name of the column containing data of CPR type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, split the number and return the birth date.
Clean a column of CPR data.
>>> df = pd.DataFrame({{ "cpr": [ "2110625629", "511062-5629"] }) >>> clean_dk_cpr(df, 'cpr') cpr cpr_clean 0 2110625629 211062-5629 1 511062-5629 NaN
validate_dk_cpr
Validate if a data cell is CPR in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Danish CVR number (CVRs).
dataprep.clean.clean_dk_cvr.
clean_dk_cvr
Clean Danish CVR number (CVRs) type data in a DataFrame column.
col – The name of the column containing data of CVR type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of CVR, the compact format is the same as the standard one.
Clean a column of CVR data.
>>> df = pd.DataFrame({{ "cvr": [ 'DK 13585628', 'DK 13585627'] }) >>> clean_dk_cvr(df, 'cvr') cvr cvr_clean 0 DK 13585628 13585628 1 DK 13585627 NaN
validate_dk_cvr
Validate if a data cell is CVR in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Dominican Republic national identifier (Cedulas).
dataprep.clean.clean_do_cedula.
clean_do_cedula
Clean Dominican Republic national identifier (Cedulas) type data in a DataFrame column.
col – The name of the column containing data of Cedula type.
Clean a column of Cedula data.
>>> df = pd.DataFrame({{ "cedula": [ "22400022111", "0011391820A"] }) >>> clean_do_cedula(df, 'cedula') cedula cedula_clean 0 22400022111 224-0002211-1 1 0011391820A NaN
validate_do_cedula
Validate if a data cell is Cedula in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Dominican Republic invoice numbers (NCFs).
dataprep.clean.clean_do_ncf.
clean_do_ncf
Clean Dominican Republic invoice numbers (NCFs) type data in a DataFrame column.
col – The name of the column containing data of NCF type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of NCF, the compact format is the same as the standard one.
Clean a column of NCF data.
>>> df = pd.DataFrame({{ "ncf": [ 'E310000000005', 'Z0100000005',] }) >>> clean_do_ncf(df, 'ncf') ncf ncf_clean 0 E310000000005 E310000000005 1 Z0100000005 NaN
validate_do_ncf
Validate if a data cell is NCF in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Dominican Republic tax registration (RNCs).
dataprep.clean.clean_do_rnc.
clean_do_rnc
Clean Dominican Republic tax registration (RNCs) type data in a DataFrame column.
col – The name of the column containing data of RNC type.
Clean a column of RNC data.
>>> df = pd.DataFrame({{ "rnc": [ "131246796", "1018A0043"] }) >>> clean_do_rnc(df, 'rnc') rnc rnc_clean 0 131246796 1-31-24679-6 1 1018A0043 NaN
validate_do_rnc
Validate if a data cell is RNC in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Ecuadorian personal identity codes (CIs).
dataprep.clean.clean_ec_ci.
clean_ec_ci
Clean Ecuadorian personal identity codes (CIs) type data in a DataFrame column.
col – The name of the column containing data of CI type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of CI, the compact format is the same as the standard one.
Clean a column of CI data.
>>> df = pd.DataFrame({{ "ci": [ '171430710-3', 'BE431150351'] }) >>> clean_ec_ci(df, 'ci') ci ci_clean 0 171430710-3 1714307103 1 1714307104 NaN
validate_ec_ci
Validate if a data cell is CI in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Ecuadorian company tax number (RUCs).
dataprep.clean.clean_ec_ruc.
clean_ec_ruc
Clean Ecuadorian company tax number (RUCs) type data in a DataFrame column.
col – The name of the column containing data of RUC type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of RUC, the compact format is the same as the standard one.
Clean a column of RUC data.
>>> df = pd.DataFrame({{ "ruc": [ '1792060346-001', '1763154690001'] }) >>> clean_ec_ruc(df, 'ruc') ruc ruc_clean 0 1792060346-001 1792060346001 1 1763154690001 NaN
validate_ec_ruc
Validate if a data cell is RUC in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Estonian Personcal ID numbers (IKs).
dataprep.clean.clean_ee_ik.
clean_ee_ik
Clean Estonian Personcal ID number (IKs) type data in a DataFrame column.
col – The name of the column containing data of IK type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, get the person’s birthdate. If output_format = ‘gender’, get the person’s birth gender (‘M’ or ‘F’). Note: in the case of IK, the compact format is the same as the standard one.
Clean a column of IK data.
>>> df = pd.DataFrame({ "ik": [ '36805280109', '36805280108'] }) >>> clean_ee_ik(df, 'ik') ik ik_clean 0 36805280109 36805280109 1 36805280108 NaN
validate_ee_ik
Validate if a data cell is IK in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Estonian KMKR numbers (KMKRs).
dataprep.clean.clean_ee_kmkr.
clean_ee_kmkr
Clean Estonian KMKR numbers (KMKRs) type data in a DataFrame column.
col – The name of the column containing data of KMKR type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of KMKR, the compact format is the same as the standard one.
Clean a column of KMKR data.
>>> df = pd.DataFrame({{ "kmkr": [ 'EE 100 931 558', '100594103'] }) >>> clean_ee_kmkr(df, 'kmkr') kmkr kmkr_clean 0 EE 100 931 558 100931558 1 100594103 NaN
validate_ee_kmkr
Validate if a data cell is KMKR in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Spanish Bank Account Codes (CCCs).
dataprep.clean.clean_es_ccc.
clean_es_ccc
Clean Spanish Bank Account Codes (CCCs) type data in a DataFrame column.
col – The name of the column containing data of CCC type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘iban’, convert the number to an IBAN.
Clean a column of CCC data.
>>> df = pd.DataFrame({{ "ccc": [ "12341234161234567890", "134-1234-16 1234567890"] }) >>> clean_es_ccc(df, 'ccc') ccc ccc_clean 0 12341234161234567890 1234 1234 16 12345 67890 1 134-1234-16 1234567890 NaN
validate_es_ccc
Validate if a data cell is CCC in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Spanish fiscal numbers (CIFs).
dataprep.clean.clean_es_cif.
clean_es_cif
Clean Spanish fiscal numbers (CIFs) type data in a DataFrame column.
col – The name of the column containing data of CIF type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of CIF, the compact format is the same as the standard one.
Clean a column of CIF data.
>>> df = pd.DataFrame({{ "cif": [ 'A13 585 625', 'M-1234567-L',] }) >>> clean_es_cif(df, 'cif') cif cif_clean 0 A13 585 625 A13585625 1 M-1234567-L NaN
validate_es_cif
Validate if a data cell is CIF in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Spanish meter point numbers (CUPSs).
dataprep.clean.clean_es_cups.
clean_es_cups
Clean Spanish meter point numbers (CUPSs) type data in a DataFrame column.
col – The name of the column containing data of CUPS type.
Clean a column of CUPS data.
>>> df = pd.DataFrame({{ "cups": [ "ES1234123456789012JY1F", "ES 1234-123456789012-XY 1F",] }) >>> clean_es_cups(df, 'cups') cups cups_clean 0 ES1234123456789012JY1F ES 1234 1234 5678 9012 JY 1F 1 ES 1234-123456789012-XY 1F NaN
validate_es_cups
Validate if a data cell is CUPS in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Spanish personal identity codes (DNIs).
dataprep.clean.clean_es_dni.
clean_es_dni
Clean Spanish personal identity codes (DNIs) type data in a DataFrame column.
col – The name of the column containing data of DNI type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of DNI, the compact format is the same as the standard one.
Clean a column of DNI data.
>>> df = pd.DataFrame({{ "dni": [ '54362315-K', '54362315'] }) >>> clean_es_dni(df, 'dni') dni dni_clean 0 54362315-K 54362315K 1 54362315 NaN
validate_es_dni
Validate if a data cell is DNI in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Spanish IBANs (IBANs).
dataprep.clean.clean_es_iban.
clean_es_iban
Clean Spanish IBANs (IBANs) type data in a DataFrame column.
col – The name of the column containing data of IBAN type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘ccc’, return the CCC (Código Cuenta Corriente) part of the number.
Clean a column of IBAN data.
>>> df = pd.DataFrame({{ "iban": [ "ES771234-1234-16 1234567890", "R1601101050000010547023795",] }) >>> clean_es_iban(df, 'iban') iban iban_clean 0 ES771234-1234-16 1234567890 ES77 1234 1234 1612 3456 7890 1 R1601101050000010547023795 NaN
validate_es_iban
Validate if a data cell is IBAN in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Spanish foreigner identity codes (NIEs).
dataprep.clean.clean_es_nie.
clean_es_nie
Clean Spanish foreigner identity codes (NIEs) type data in a DataFrame column.
col – The name of the column containing data of NIE type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of NIE, the compact format is the same as the standard one.
Clean a column of NIE data.
>>> df = pd.DataFrame({{ "nie": [ 'x-2482300w', 'x-2482300a'] }) >>> clean_es_nie(df, 'nie') nie nie_clean 0 x-2482300w X2482300W 1 x-2482300a NaN
validate_es_nie
Validate if a data cell is NIE in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Spanish NIF numbers (NIFs).
dataprep.clean.clean_es_nif.
clean_es_nif
Clean Spanish NIF numbers (NIFs) type data in a DataFrame column.
col – The name of the column containing data of NIF type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of NIF, the compact format is the same as the standard one.
Clean a column of NIF data.
>>> df = pd.DataFrame({{ "nif": [ 'ES B-58378431', 'B64717839'] }) >>> clean_es_nif(df, 'nif') nif nif_clean 0 ES B-58378431 B58378431 1 B64717839 NaN
validate_es_nif
Validate if a data cell is NIF in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing classification for businesses in the European Union (NACE).
dataprep.clean.clean_eu_nace.
clean_eu_nace
Clean classification for businesses in the European Union type data in a DataFrame column.
col – The name of the column containing data of NACE type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘label’, return the category label for the number.
Clean a column of NACE data.
>>> df = pd.DataFrame({{ "nace": [ "6201", "99999999999"] }) >>> clean_eu_nace(df, 'nace') nace nace_clean 0 6201 62.01 1 62059 NaN
validate_eu_nace
Validate if a data cell is NACE in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing European VAT numbers (VATs).
dataprep.clean.clean_eu_vat.
clean_eu_vat
Clean European VAT numbers (VATs) type data in a DataFrame column.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘country’, guess the country code based on the number and
return the list of valid and lower case codes.
Note: in the case of VAT, the compact format is the same as the standard one.
>>> df = pd.DataFrame({{ "vat": [ 'ATU 57194903', 'FR 61 954 506 077'] }) >>> clean_eu_vat(df, 'vat') vat vat_clean 0 ATU 57194903 ATU57194903 1 FR 61 954 506 077 FR61954506077
validate_eu_vat
Clean and validate a DataFrame column containing Finnish ALV numbers (ALVs).
dataprep.clean.clean_fi_alv.
clean_fi_alv
Clean Finnish ALV numbers (ALVs) type data in a DataFrame column.
col – The name of the column containing data of ALV type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of ALV, the compact format is the same as the standard one.
Clean a column of ALV data.
>>> df = pd.DataFrame({{ "alv": [ 'FI 20774740', 'FI 20774741'] }) >>> clean_fi_alv(df, 'alv') alv alv_clean 0 FI 20774740 20774740 1 FI 20774741 NaN
validate_fi_alv
Validate if a data cell is ALV in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Finnish personal identity codes (HETUs).
dataprep.clean.clean_fi_hetu.
clean_fi_hetu
Clean Finnish personal identity codes (HETUs) type data in a DataFrame column.
col – The name of the column containing data of HETU type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of HETU, the compact format is the same as the standard one.
Clean a column of HETU data.
>>> df = pd.DataFrame({{ "hetu": [ '131052a308t', '131052-308U'] }) >>> clean_fi_hetu(df, 'hetu') hetu hetu_clean 0 131052a308t 131052A308T 1 131052-308U NaN
validate_fi_hetu
Validate if a data cell is HETU in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Finnish business identifiers (y-tunnus).
dataprep.clean.clean_fi_ytunnus.
clean_fi_ytunnus
Clean Finnish business identifiers (y-tunnus) type data in a DataFrame column.
col – The name of the column containing data of y-tunnus type.
Clean a column of y-tunnus data.
>>> df = pd.DataFrame({{ "ytunnus": [ "20774740", "2077474-1",] }) >>> clean_fi_ytunnus(df, 'ytunnus') ytunnus ytunnus_clean 0 20774740 2077474-0 1 2077474-1 NaN
validate_fi_ytunnus
Validate if a data cell is y-tunnus in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing French tax identification numbers (NIFs).
dataprep.clean.clean_fr_nif.
clean_fr_nif
Clean French tax identification numbers (NIFs) type data in a DataFrame column.
>>> df = pd.DataFrame({{ "nif": [ "0701987765432", "070198776543"] }) >>> clean_fr_nif(df, 'nif') nif nif_clean 0 0701987765432 07 01 987 765 432 1 070198776543 NaN
validate_fr_nif
Clean and validate a DataFrame column containing French personal identification numbers (NIRs).
dataprep.clean.clean_fr_nir.
clean_fr_nir
Clean French personal identification numbers (NIRs) type data in a DataFrame column.
col – The name of the column containing data of NIR type.
Clean a column of NIR data.
>>> df = pd.DataFrame({{ "nir": [ "295109912611193", "253072C07300443"] }) >>> clean_fr_nir(df, 'nir') nir nir_clean 0 295109912611193 2 95 10 99 126 111 93 1 253072C07300443 NaN
validate_fr_nir
Validate if a data cell is NIR in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing French company identification numbers (SIRENs).
dataprep.clean.clean_fr_siren.
clean_fr_siren
Clean French company identification numbers (SIRENs) type data in a DataFrame column.
col – The name of the column containing data of SIREN type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘tva’, return a TVA that preposes two extra check digits to the data. Note: in the case of SIREN, the compact format is the same as the standard one.
Clean a column of SIREN data.
>>> df = pd.DataFrame({{ "siren": [ '552 008 443', '404833047'] }) >>> clean_fr_siren(df, 'siren') siren siren_clean 0 552 008 443 552008443 1 404833047 NaN
validate_fr_siren
Validate if a data cell is SIREN in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing French TVA numbers (TVAs).
dataprep.clean.clean_fr_tva.
clean_fr_tva
Clean French TVA numbers (TVAs) type data in a DataFrame column.
col – The name of the column containing data of TVA type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of TVA, the compact format is the same as the standard one.
Clean a column of TVA data.
>>> df = pd.DataFrame({{ "tva": [ 'Fr 40 303 265 045', '84 323 140 391'] }) >>> clean_fr_tva(df, 'tva') tva tva_clean 0 Fr 40 303 265 045 40303265045 1 84 323 140 391 NaN
validate_fr_tva
Validate if a data cell is TVA in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Stock Exchange Daily Official List numbers (SEDOLs).
dataprep.clean.clean_gb_sedol.
clean_gb_sedol
Clean Stock Exchange Daily Official List number in a DataFrame column.
col – The name of the column containing data of SEDOL type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘isin’, convert the number to an ISIN. Note: in the case of SEDOL, the compact format is the same as the standard one.
Clean a column of SEDOL data.
>>> df = pd.DataFrame({{ "sedol": [ 'B15KXQ8', 'B15KXQ7'] }) >>> clean_gb_sedol(df, 'sedol') sedol sedol_clean 0 B15KXQ8 B15KXQ8 1 B15KXQ7 NaN
validate_gb_sedol
Validate if a data cell is SEDOL in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing English Unique Pupil Numbers (UPNs).
dataprep.clean.clean_gb_upn.
clean_gb_upn
Clean English Unique Pupil Numbers (UPNs) type data in a DataFrame column.
col – The name of the column containing data of UPN type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of UPN, the compact format is the same as the standard one.
Clean a column of UPN data.
>>> df = pd.DataFrame({{ "upn": [ 'B801200005001', 'A801200005001'] }) >>> clean_gb_upn(df, 'upn') upn upn_clean 0 B801200005001 B801200005001 1 A801200005001 NaN
validate_gb_upn
Validate if a data cell is UPN in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing United Kingdom Unique Taxpayer Reference (UTRs).
dataprep.clean.clean_gb_utr.
clean_gb_utr
Clean United Kingdom Unique Taxpayer Reference (UTRs) in a DataFrame column.
col – The name of the column containing data of UTR type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of UTR, the compact format is the same as the standard one.
Clean a column of UTR data.
>>> df = pd.DataFrame({{ "utr": [ '1955839661', '2955839661',] }) >>> clean_gb_utr(df, 'utr') utr utr_clean 0 1955839661 1955839661 1 2955839661 NaN
validate_gb_utr
Validate if a data cell is UTR in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing United Kingdom VAT numbers (VATs).
dataprep.clean.clean_gb_vat.
clean_gb_vat
Clean United Kingdom VAT numbers (VATs) type data in a DataFrame column.
>>> df = pd.DataFrame({{ "vat": [ "980780684", "802311781"] }) >>> clean_gb_vat(df, 'vat') vat vat_clean 0 980780684 980 7806 84 1 802311781 NaN
validate_gb_vat
Clean and validate a DataFrame column containing Greek social security numbers (AMKAs).
dataprep.clean.clean_gr_amka.
clean_gr_amka
Clean Greek social security numbers (AMKAs) type data in a DataFrame column.
col – The name of the column containing data of AMKA type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, get the person’s birthdate. If output_format = ‘gender’, get the person’s birth gender (‘M’ or ‘F’). Note: in the case of AMKA, the compact format is the same as the standard one.
Clean a column of AMKA data.
>>> df = pd.DataFrame({{ "amka": [ '01013099997', '01013099999'] }) >>> clean_gr_amka(df, 'amka') amka amka_clean 0 01013099997 01013099997 1 01013099999 NaN
validate_gr_amka
Validate if a data cell is AMKA in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Greek VAT numbers (VATs).
dataprep.clean.clean_gr_vat.
clean_gr_vat
Clean Greek VAT numbers (VATs) type data in a DataFrame column.
>>> df = pd.DataFrame({{ "vat": [ 'EL 094259216', 'EL 123456781'] }) >>> clean_gr_vat(df, 'vat') vat vat_clean 0 EL 094259216 094259216 1 EL 123456781 NaN
validate_gr_vat
Clean and validate a DataFrame column containing Guatemala tax numbers (NITs).
dataprep.clean.clean_gt_nit.
clean_gt_nit
Clean Guatemala tax numbers (NITs) type data in a DataFrame column.
>>> df = pd.DataFrame({{ "nit": [ "39525503", "8977112-0",] }) >>> clean_gt_nit(df, 'nit') nit nit_clean 0 39525503 3952550-3 1 8977112-0 NaN
validate_gt_nit
Clean and validate a DataFrame column containing Croatian identification numbers (OIBs).
dataprep.clean.clean_hr_oib.
clean_hr_oib
Clean Croatian identification numbers (OIBs) type data in a DataFrame column.
col – The name of the column containing data of OIB type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of OIB, the compact format is the same as the standard one.
Clean a column of OIB data.
>>> df = pd.DataFrame({{ "oib": [ 'HR 33392005961', '33392005962',] }) >>> clean_hr_oib(df, 'oib') oib oib_clean 0 HR 33392005961 33392005961 1 33392005962 NaN
validate_hr_oib
Validate if a data cell is OIB in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Hungarian ANUM numbers (ANUMs).
dataprep.clean.clean_hu_anum.
clean_hu_anum
Clean Hungarian ANUM numbers (ANUMs) type data in a DataFrame column.
col – The name of the column containing data of ANUM type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of ANUM, the compact format is the same as the standard one.
Clean a column of ANUM data.
>>> df = pd.DataFrame({{ "anum": [ 'HU-12892312', 'HU-12892313',] }) >>> clean_hu_anum(df, 'anum') anum anum_clean 0 HU-12892312 12892312 1 HU-12892313 NaN
validate_hu_anum
Validate if a data cell is ANUM in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Indonesian VAT Numbers (NPWPs).
dataprep.clean.clean_id_npwp.
clean_id_npwp
Clean Indonesian VAT Numbers (NPWPs) type data in a DataFrame column.
col – The name of the column containing data of NPWP type.
Clean a column of NPWP data.
>>> df = pd.DataFrame({{ "npwp": [ "013000666091000", "123456789",] }) >>> clean_id_npwp(df, 'npwp') npwp npwp_clean 0 013000666091000 01.300.066.6-091.000 1 123456789 NaN
validate_id_npwp
Validate if a data cell is NPWP in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Irish personal numbers (PPSs).
dataprep.clean.clean_ie_pps.
clean_ie_pps
Clean Irish personal numbers (PPSs) type data in a DataFrame column.
col – The name of the column containing data of PPS type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of PPS, the compact format is the same as the standard one.
Clean a column of PPS data.
>>> df = pd.DataFrame({{ "pps": [ '6433435OA', '6433435VH',] }) >>> clean_ie_pps(df, 'pps') pps pps_clean 0 6433435OA 6433435OA 1 6433435VH NaN
validate_ie_pps
Validate if a data cell is PPS in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Irish VAT numbers (VATs).
dataprep.clean.clean_ie_vat.
clean_ie_vat
Clean Irish VAT numbers (VATs) type data in a DataFrame column.
>>> df = pd.DataFrame({{ "vat": [ 'IE 6433435OA', '6433435E',] }) >>> clean_ie_vat(df, 'vat') vat vat_clean 0 IE 6433435OA 6433435OA 1 6433435E NaN
validate_ie_vat
Clean and validate a DataFrame column containing Israeli company numbers (HPs).
dataprep.clean.clean_il_hp.
clean_il_hp
Clean Israeli company numbers (HPs) type data in a DataFrame column.
col – The name of the column containing data of HP type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of HP, the compact format is the same as the standard one.
Clean a column of HP data.
>>> df = pd.DataFrame({{ "hp": [ ' 5161 79157 ', '516179150',] }) >>> clean_il_hp(df, 'hp') hp hp_clean 0 5161 79157 516179157 1 516179150 NaN
validate_il_hp
Validate if a data cell is HP in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Israeli personal numbers (IDNRs).
dataprep.clean.clean_il_idnr.
clean_il_idnr
Clean Israeli personal numbers (IDNRs) type data in a DataFrame column.
>>> df = pd.DataFrame({{ "idnr": [ "39337423", "3933742-2",] }) >>> clean_il_idnr(df, 'idnr') idnr idnr_clean 0 39337423 03933742-3 1 3933742-2 NaN
validate_il_idnr
Clean and validate a DataFrame column containing Indian digital resident personal identity numbers (Aadhaars).
dataprep.clean.clean_in_aadhaar.
clean_in_aadhaar
Clean Indian digital resident personal identity number (Aadhaars) in a DataFrame column.
col – The name of the column containing data of Aadhaar type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘mask’, mask the first 8 digits as per MeitY guidelines for
securing identity information and Sensitive personal data.
Clean a column of Aadhaar data.
>>> df = pd.DataFrame({{ "aadhaar": [ "234123412346", "643343121",] }) >>> clean_in_aadhaar(df, 'aadhaar') aadhaar aadhaar_clean 0 234123412346 2341 2341 2346 1 643343121 NaN
validate_in_aadhaar
Validate if a data cell is Aadhaar in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Indian Permanent Account numbers (PANs).
dataprep.clean.clean_in_pan.
clean_in_pan
Clean Indian Permanent Account numbers (PANs) type data in a DataFrame column.
col – The name of the column containing data of PAN type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘info’, return a dictionary containing information
that can be decoded from the PAN.
If output_format = ‘mask’, mask the PAN as per CBDT masking standard. Note: in the case of PAN, the compact format is the same as the standard one.
Clean a column of PAN data.
>>> df = pd.DataFrame({{ "pan": [ 'ACUPA7085R', '234123412347',] }) >>> clean_in_pan(df, 'pan') pan pan_clean 0 ACUPA7085R ACUPA7085R 1 234123412347 NaN
validate_in_pan
Validate if a data cell is PAN in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Icelandic identity codes (Kennitalas).
dataprep.clean.clean_is_kennitala.
clean_is_kennitala
Clean Icelandic identity codes (Kennitalas) type data in a DataFrame column.
col – The name of the column containing data of Kennitala type.
Clean a column of Kennitala data.
>>> df = pd.DataFrame({{ "kennitala": [ "1201743399", "320174-3399",] }) >>> clean_is_kennitala(df, 'kennitala') kennitala kennitala_clean 0 1201743399 120174-3399 1 320174-3399 NaN
validate_is_kennitala
Validate if a data cell is Kennitala in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Icelandic VSK numbers (VSKs).
dataprep.clean.clean_is_vsk.
clean_is_vsk
Clean Icelandic VSK numbers (VSKs) type data in a DataFrame column.
col – The name of the column containing data of VSK type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of VSK, the compact format is the same as the standard one.
Clean a column of VSK data.
>>> df = pd.DataFrame({{ "vsk": [ 'IS 00621', 'IS 0062199',] }) >>> clean_is_vsk(df, 'vsk') vsk vsk_clean 0 IS 00621 00621 1 IS 0062199 NaN
validate_is_vsk
Validate if a data cell is VSK in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Italian code for identification of drugs (AICs).
dataprep.clean.clean_it_aic.
clean_it_aic
Clean Italian code for identification of drugs (AICs) type data in a DataFrame column.
col – The name of the column containing data of AIC type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘base10’, convert a BASE32 representation to a BASE10 one. If output_format = ‘base32’, convert a BASE10 representation to a BASE32 one. Note: in the case of AIC, the compact format is the same as the standard one.
And ‘compact’ may contain both BASE10 and BASE32 represatation.
Clean a column of AIC data.
>>> df = pd.DataFrame({{ "aic": [ '000307052', '999999',] }) >>> clean_it_aic(df, 'aic') aic aic_clean 0 000307052 000307052 1 999999 NaN
validate_it_aic
Validate if a data cell is AIC in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Italian fiscal codes (Codice Fiscales).
dataprep.clean.clean_it_codicefiscale.
clean_it_codicefiscale
Clean Italian fiscal code (Codice Fiscales) type data in a DataFrame column.
col – The name of the column containing data of Codice Fiscale type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, get the person’s birthdate. If output_format = ‘gender’, get the person’s birth gender (‘M’ or ‘F’). Note: in the case of Codice Fiscale, the compact format is the same as the standard.
Clean a column of Codice Fiscale data.
>>> df = pd.DataFrame({{ "codicefiscale": [ 'RCCMNL83S18D969H', 'RCCMNL83S18D969'] }) >>> clean_it_codicefiscale(df, 'codicefiscale') codicefiscale codicefiscale_clean 0 RCCMNL83S18D969H RCCMNL83S18D969H 1 RCCMNL83S18D969 NaN
validate_it_codicefiscale
Validate if a data cell is Codice Fiscale in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Italian IVA numbers (IVAs).
dataprep.clean.clean_it_iva.
clean_it_iva
Clean Italian IVA numbers (IVAs) type data in a DataFrame column.
col – The name of the column containing data of IVA type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of IVA, the compact format is the same as the standard one.
Clean a column of IVA data.
>>> df = pd.DataFrame({{ "iva": [ 'IT 00743110157', '00743110158',] }) >>> clean_it_iva(df, 'iva') iva iva_clean 0 IT 00743110157 00743110157 1 00743110158 NaN
validate_it_iva
Validate if a data cell is IVA in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Japanese Corporate Numbers (CNs).
dataprep.clean.clean_jp_cn.
clean_jp_cn
Clean Japanese Corporate Numbers (CNs) type data in a DataFrame column.
col – The name of the column containing data of CN type.
Clean a column of CN data.
>>> df = pd.DataFrame({{ "cn": [ "5835678256246", "2-8356-7825-6246",] }) >>> clean_jp_cn(df, 'cn') cn cn_clean 0 5835678256246 5-8356-7825-6246 1 2-8356-7825-6246 NaN
validate_jp_cn
Validate if a data cell is CN in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing South Korea Business Registration Numbers (BRNs).
dataprep.clean.clean_kr_brn.
clean_kr_brn
Clean South Korea Business Registration Numbers (BRNs) type data in a DataFrame column.
col – The name of the column containing data of BRN type.
Clean a column of BRN data.
>>> df = pd.DataFrame({{ "brn": [ "1348672683", "123456789",] }) >>> clean_kr_brn(df, 'brn') brn brn_clean 0 1348672683 134-86-72683 1 123456789 NaN
validate_kr_brn
Validate if a data cell is BRN in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing South Korean resident registration numbers (RRNs).
dataprep.clean.clean_kr_rrn.
clean_kr_rrn
Clean South Korean resident registration numbers (RRNs) type data in a DataFrame column.
col – The name of the column containing data of RRN type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, get the person’s birthdate.
Clean a column of RRN data.
>>> df = pd.DataFrame({{ "rrn": [ "971013-9019902", "971013-9019903",] }) >>> clean_kr_rrn(df, 'rrn') rrn rrn_clean 0 971013-9019902 971013-9019902 1 971013-9019903 NaN
validate_kr_rrn
Validate if a data cell is RRN in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Liechtenstein tax code for individuals and entities (PEIDs).
dataprep.clean.clean_li_peid.
clean_li_peid
Clean Liechtenstein tax code for individuals and entities data in a DataFrame column.
col – The name of the column containing data of PEID type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of PEID, the compact format is the same as the standard one.
Clean a column of PEID data.
>>> df = pd.DataFrame({{ "peid": [ '00001234567', '00001234568913454545',] }) >>> clean_li_peid(df, 'peid') peid peid_clean 0 00001234567 1234567 1 00001234568913454545 NaN
validate_li_peid
Validate if a data cell is PEID in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Lithuanian personal numbers (Asmens kodas).
dataprep.clean.clean_lt_asmens.
clean_lt_asmens
Clean Lithuanian personal numbers (Asmens kodas) type data in a DataFrame column.
col – The name of the column containing data of Asmens kodas type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, return the birthdate of the person. Note: in the case of Asmens kodas, the compact format is the same as the standard one.
Clean a column of Asmens kodas data.
>>> df = pd.DataFrame({{ "asmens": [ '33309240064', '33309240164',] }) >>> clean_lt_asmens(df, 'asmens') asmens asmens_clean 0 33309240064 33309240064 1 33309240164 NaN
validate_lt_asmens
Validate if a data cell is Asmens kodas in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Lithuanian PVM numbers (PVMs).
dataprep.clean.clean_lt_pvm.
clean_lt_pvm
Clean Lithuanian PVM numbers (PVMs) type data in a DataFrame column.
col – The name of the column containing data of PVM type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of PVM, the compact format is the same as the standard one.
Clean a column of PVM data.
>>> df = pd.DataFrame({{ "pvm": [ '119511515', '100001919018',] }) >>> clean_lt_pvm(df, 'pvm') pvm pvm_clean 0 119511515 119511515 1 100001919018 NaN
validate_lt_pvm
Validate if a data cell is PVM in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Luxembourgian TVA numbers (TVAs).
dataprep.clean.clean_lu_tva.
clean_lu_tva
Clean Luxembourgian TVA numbers (TVAs) type data in a DataFrame column.
>>> df = pd.DataFrame({{ "tva": [ 'LU 150 274 42', '150 274 43',] }) >>> clean_lu_tva(df, 'tva') tva tva_clean 0 LU 150 274 42 15027442 1 150 274 43 NaN
validate_lu_tva
Clean and validate a DataFrame column containing Latvian PVN (VAT) numbers (PVNs).
dataprep.clean.clean_lv_pvn.
clean_lv_pvn
Clean Latvian PVN (VAT) numbers (PVNs) type data in a DataFrame column.
col – The name of the column containing data of PVN type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, return the birthdate of the person. Note only when
PVN refers to a person (but not a legal entity) this format will be available.
Note: in the case of PVN, the compact format is the same as the standard one.
Clean a column of PVN data.
>>> df = pd.DataFrame({{ "pvn": [ '161175-19997', '40003521601',] }) >>> clean_lv_pvn(df, 'pvn') pvn pvn_clean 0 161175-19997 16117519997 1 40003521601 NaN
validate_lv_pvn
Validate if a data cell is PVN in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Monacan TVA numbers (TVAs).
dataprep.clean.clean_mc_tva.
clean_mc_tva
Clean Monacan TVA numbers (TVAs) type data in a DataFrame column.
>>> df = pd.DataFrame({{ "tva": [ '53 0000 04605', 'FR 61 954 506 077',] }) >>> clean_mc_tva(df, 'tva') tva tva_clean 0 53 0000 04605 FR53000004605 1 FR 61 954 506 077 NaN
validate_mc_tva
Clean and validate a DataFrame column containing Moldavian company identification numbers (IDNOs).
dataprep.clean.clean_md_idno.
clean_md_idno
Clean Moldavian company identification numbers (IDNOs) type data in a DataFrame column.
col – The name of the column containing data of IDNO type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of IDNO, the compact format is the same as the standard one.
Clean a column of IDNO data.
>>> df = pd.DataFrame({{ "idno": [ '1008600038413', '1008600038412',] }) >>> clean_md_idno(df, 'idno') idno idno_clean 0 1008600038413 1008600038413 1 1008600038412 NaN
validate_md_idno
Validate if a data cell is IDNO in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Montenegro IBANs (IBANs).
dataprep.clean.clean_me_iban.
clean_me_iban
Clean Montenegro IBANs (IBANs) type data in a DataFrame column.
>>> df = pd.DataFrame({{ "iban": [ "ME25510000000006234133", "ME52510000000006234132",] }) >>> clean_me_iban(df, 'iban') iban iban_clean 0 ME25510000000006234133 ME 2551 0000 0000 0623 4133 1 ME52510000000006234132 NaN
validate_me_iban
Clean and validate a DataFrame column containing Maltese VAT numbers (VATs).
dataprep.clean.clean_mt_vat.
clean_mt_vat
Clean Maltese VAT numbers (VATs) type data in a DataFrame column.
>>> df = pd.DataFrame({{ "vat": [ 'MT 1167-9112', '1167-9113',] }) >>> clean_mt_vat(df, 'vat') vat vat_clean 0 MT 1167-9112 11679112 1 1167-9113 NaN
validate_mt_vat
Clean and validate a DataFrame column containing Mauritian national ID numbers (NIDs).
dataprep.clean.clean_mu_nid.
clean_mu_nid
Clean Mauritian national ID numbers (NIDs) type data in a DataFrame column.
col – The name of the column containing data of NID type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, return the birthdate of the person. Note: in the case of NID, the compact format is the same as the standard one.
Clean a column of NID data.
>>> df = pd.DataFrame({{ "nid": [ 'J2906201304089', 'J2906201304088',] }) >>> clean_mu_nid(df, 'nid') nid nid_clean 0 J2906201304089 J2906201304089 1 J2906201304088 NaN
validate_mu_nid
Validate if a data cell is NID in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Mexican personal identifiers (CURPs).
dataprep.clean.clean_mx_curp.
clean_mx_curp
Clean Estonian Personcal ID number (CURPs) type data in a DataFrame column.
col – The name of the column containing data of CURP type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, get the person’s birthdate. If output_format = ‘gender’, get the person’s birth gender (‘M’ or ‘F’). Note: in the case of CURP, the compact format is the same as the standard one.
Clean a column of CURP data.
>>> df = pd.DataFrame({{ "curp": [ 'BOXW310820HNERXN09', 'BOXW310820HNERXN08'] }) >>> clean_mx_curp(df, 'curp') curp curp_clean 0 BOXW310820HNERXN09 BOXW310820HNERXN09 1 BOXW310820HNERXN08 NaN
validate_mx_curp
Validate if a data cell is CURP in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Mexican tax numbers (RFCs).
dataprep.clean.clean_mx_rfc.
clean_mx_rfc
Clean Mexican tax numbers (RFCs) type data in a DataFrame column.
col – The name of the column containing data of RFC type.
Clean a column of RFC data.
>>> df = pd.DataFrame({{ "rfc": [ "GODE561231GR8", "BUEI591231GH9",] }) >>> clean_mx_rfc(df, 'rfc') rfc rfc_clean 0 GODE561231GR8 GODE 561231 GR8 1 BUEI591231GH9 NaN
validate_mx_rfc
Validate if a data cell is RFC in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Malaysian National Registration Identity Card Numbers (NRICs).
dataprep.clean.clean_my_nric.
clean_my_nric
Clean Malaysian National Registration Identity Card Numbers (NRICs) in a DataFrame column.
col – The name of the column containing data of NRIC type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, return the registration date or the birth date. If output_format = ‘birthplace’, return a dict containing the birthplace of the person.
Clean a column of NRIC data.
>>> df = pd.DataFrame({{ "nric": [ "770305021234", "771305-02-1234",] }) >>> clean_my_nric(df, 'nric') nric nric_clean 0 770305021234 770305-02-1234 1 771305-02-1234 NaN
validate_my_nric
Validate if a data cell is NRIC in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Brin numbers (BRINs).
dataprep.clean.clean_nl_brin.
clean_nl_brin
Clean Brin numbers (BRINs) type data in a DataFrame column.
col – The name of the column containing data of BRIN type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of BRIN, the compact format is the same as the standard one.
Clean a column of BRIN data.
>>> df = pd.DataFrame({{ "brin": [ '05 KO', '30AJ0A',] }) >>> clean_nl_brin(df, 'brin') brin brin_clean 0 05 KO 05KO 1 30AJ0A NaN
validate_nl_brin
Validate if a data cell is BRIN in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Dutch BTW numbers (BTWs).
dataprep.clean.clean_nl_btw.
clean_nl_btw
Clean Dutch BTW numbers (BTWs) type data in a DataFrame column.
col – The name of the column containing data of BTW type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of BTW, the compact format is the same as the standard one.
Clean a column of BTW data.
>>> df = pd.DataFrame({{ "btw": [ '004495445B01', '123456789B90',] }) >>> clean_nl_btw(df, 'btw') btw btw_clean 0 004495445B01 004495445B01 1 123456789B90 NaN
validate_nl_btw
Validate if a data cell is BTW in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Norwegian IBANs (IBANs).
dataprep.clean.clean_no_iban.
clean_no_iban
Clean Norwegian IBANs (IBANs) type data in a DataFrame column.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘kontonr’, return the Norwegian bank account part of the number.
>>> df = pd.DataFrame({ "iban": [ 'NO9386011117947', 'NO92 8601 1117 947',] }) >>> clean_no_iban(df, 'iban') iban iban_clean 0 NO9386011117947 NO93 8601 1117 947 1 NO92 8601 1117 947 NaN
validate_no_iban
Clean and validate a DataFrame column containing Norwegian bank account numbers (kontonrs).
dataprep.clean.clean_no_kontonr.
clean_no_kontonr
Clean Norwegian bank account numbers (kontonrs) type data in a DataFrame column.
col – The name of the column containing data of kontonr type.
Clean a column of kontonr data.
>>> df = pd.DataFrame({ "kontonr": [ "8601 11 17947", "8601 11 17949",] }) >>> clean_no_kontonr(df, 'kontonr') kontonr kontonr_clean 0 8601 11 17947 8601.11.17947 1 8601 11 17949 NaN
validate_no_kontonr
Validate if a data cell is kontonr in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Norwegian VAT numbers (MVAs).
dataprep.clean.clean_no_mva.
clean_no_mva
Clean Norwegian VAT numbers (MVAs) type data in a DataFrame column.
col – The name of the column containing data of MVA type.
Clean a column of MVA data.
>>> df = pd.DataFrame({{ "mva": [ "995525828MVA", "NO 995 525 829 MVA",] }) >>> clean_no_mva(df, 'mva') mva mva_clean 0 995525828MVA NO 995 525 828 MVA 1 NO 995 525 829 MVA NaN
validate_no_mva
Validate if a data cell is MVA in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Norwegian organisation numbers (Orgnrs).
dataprep.clean.clean_no_orgnr.
clean_no_orgnr
Clean Norwegian organisation numbers (Orgnrs) type data in a DataFrame column.
col – The name of the column containing data of Orgnr type.
Clean a column of Orgnr data.
>>> df = pd.DataFrame({ "orgnr": [ "988077917", "988 077 918",] }) >>> clean_no_orgnr(df, 'orgnr') orgnr orgnr_clean 0 988077917 988 077 917 1 988 077 918 NaN
validate_no_orgnr
Validate if a data cell is Orgnr in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing New Zealand IRD numbers (IRDs).
dataprep.clean.clean_nz_ird.
clean_nz_ird
Clean New Zealand IRD numbers (IRDs) type data in a DataFrame column.
col – The name of the column containing data of IRD type.
Clean a column of IRD data.
>>> df = pd.DataFrame({ "ird": [ "49091850", "136410133",] }) >>> clean_nz_ird(df, 'ird') ird ird_clean 0 49091850 49-091-850 1 136410133 NaN
validate_nz_ird
Validate if a data cell is IRD in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Peruvian personal numbers (CUIs).
dataprep.clean.clean_pe_cui.
clean_pe_cui
Clean Peruvian personal numbers (CUIs) type data in a DataFrame column.
col – The name of the column containing data of CUI type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘ruc’, convert the number to a valid RUC. Note: in the case of CUI, the compact format is the same as the standard one.
Clean a column of CUI data.
>>> df = pd.DataFrame({ "cui": [ "10117410", "10117410-3",] }) >>> clean_pe_cui(df, 'cui') cui cui_clean 0 10117410 10117410 1 10117410-3 NaN
validate_pe_cui
Validate if a data cell is CUI in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Peruvian fiscal numbers (RUCs).
dataprep.clean.clean_pe_ruc.
clean_pe_ruc
Clean Peruvian fiscal numbers (RUCs) type data in a DataFrame column.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘dni’, return the DNI (CUI) part of the number for natural persons.
If the RUC is not for natural persons, return NaN.
Note: in the case of RUC, the compact format is the same as the standard one.
>>> df = pd.DataFrame({ "ruc": [ "20512333797", "20512333798",] }) >>> clean_pe_ruc(df, 'ruc') ruc ruc_clean 0 20512333797 20512333797 1 20512333798 NaN
validate_pe_ruc
Clean and validate a DataFrame column containing Polish VAT numbers (NIPs).
dataprep.clean.clean_pl_nip.
clean_pl_nip
Clean Polish VAT numbers (NIPs) type data in a DataFrame column.
col – The name of the column containing data of NIP type.
Clean a column of NIP data.
>>> df = pd.DataFrame({{ "nip": [ "PL 8567346215", "PL 8567346216",] }) >>> clean_pl_nip(df, 'nip') nip nip_clean 0 PL 8567346215 856-734-62-15 1 PL 8567346216 NaN
validate_pl_nip
Validate if a data cell is NIP in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Polish national identification numbers (PESELs).
dataprep.clean.clean_pl_pesel.
clean_pl_pesel
Clean Estonian Personcal ID number (PESELs) type data in a DataFrame column.
col – The name of the column containing data of PESEL type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, get the person’s birthdate. If output_format = ‘gender’, get the person’s birth gender (‘M’ or ‘F’). Note: in the case of PESEL, the compact format is the same as the standard one.
Clean a column of PESEL data.
>>> df = pd.DataFrame({ "pesel": [ "44051401359", "44051401358",] }) >>> clean_pl_pesel(df, 'pesel') pesel pesel_clean 0 44051401359 44051401359 1 44051401358 NaN
validate_pl_pesel
Validate if a data cell is PESEL in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Polish register of economic units (REGONs).
dataprep.clean.clean_pl_regon.
clean_pl_regon
Clean Polish register of economic units (REGONs) type data in a DataFrame column.
col – The name of the column containing data of REGON type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of REGON, the compact format is the same as the standard one.
Clean a column of REGON data.
>>> df = pd.DataFrame({ "regon": [ '192598184', '192598183',] }) >>> clean_pl_regon(df, 'regon') regon regon_clean 0 192598184 192598184 1 192598183 NaN
validate_pl_regon
Validate if a data cell is REGON in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Portuguese NIF numbers (NIFs).
dataprep.clean.clean_pt_nif.
clean_pt_nif
Clean Portuguese NIF numbers (NIFs) type data in a DataFrame column.
>>> df = pd.DataFrame({ "nif": [ 'PT 501 964 843', 'PT 501 964 842',] }) >>> clean_pt_nif(df, 'nif') nif nif_clean 0 PT 501 964 843 501964843 1 PT 501 964 842 NaN
validate_pt_nif
Clean and validate a DataFrame column containing Paraguay RUC numbers (RUCs).
dataprep.clean.clean_py_ruc.
clean_py_ruc
Clean Paraguay RUC numbers (RUCs) type data in a DataFrame column.
>>> df = pd.DataFrame({ "ruc": [ "800000358", "80123456789",] }) >>> clean_py_ruc(df, 'ruc') ruc ruc_clean 0 800000358 80000035-8 1 80123456789 NaN
validate_py_ruc
Clean and validate a DataFrame column containing Romanian CF (VAT) numbers (CFs).
dataprep.clean.clean_ro_cf.
clean_ro_cf
Clean Romanian CF (VAT) numbers (CFs) type data in a DataFrame column.
col – The name of the column containing data of CF type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of CF, the compact format is the same as the standard one.
Clean a column of CF data.
>>> df = pd.DataFrame({ "cf": [ "RO 185 472 90", "RO 185 472 903333",] }) >>> clean_ro_cf(df, 'cf') cf cf_clean 0 RO 185 472 90 RO18547290 1 RO 185 472 903333 NaN
validate_ro_cf
Validate if a data cell is CF in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Romanian Numerical Personal Codes (CNPs).
dataprep.clean.clean_ro_cnp.
clean_ro_cnp
Clean Estonian Personcal ID number (CNPs) type data in a DataFrame column.
col – The name of the column containing data of CNP type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, get the person’s birthdate. Note: in the case of CNP, the compact format is the same as the standard one.
Clean a column of CNP data.
>>> df = pd.DataFrame({ "cnp": [ "1630615123457", "0800101221142",] }) >>> clean_ro_cnp(df, 'cnp') cnp cnp_clean 0 1630615123457 1630615123457 1 0800101221142 NaN
validate_ro_cnp
Validate if a data cell is CNP in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Romanian company identifiers (CUIs).
dataprep.clean.clean_ro_cui.
clean_ro_cui
Clean Romanian company identifiers (CUIs) type data in a DataFrame column.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of CUI, the compact format is the same as the standard one.
>>> df = pd.DataFrame({ "cui": [ "185 472 90", "185 472 91",] }) >>> clean_ro_cui(df, 'cui') cui cui_clean 0 185 472 90 18547290 1 185 472 91 NaN
validate_ro_cui
Clean and validate a DataFrame column containing Romanian Trade Register identifiers (ONRCs).
dataprep.clean.clean_ro_onrc.
clean_ro_onrc
Clean Romanian Trade Register identifiers (ONRCs) type data in a DataFrame column.
col – The name of the column containing data of ONRC type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of ONRC, the compact format is the same as the standard one.
Clean a column of ONRC data.
>>> df = pd.DataFrame({ "onrc": [ "J52/750/2012", "X52/750/2012",] }) >>> clean_ro_onrc(df, 'onrc') onrc onrc_clean 0 J52/750/2012 J52/750/2012 1 X52/750/2012 NaN
validate_ro_onrc
Validate if a data cell is ONRC in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing French company establishment identification numbers (SIRETs).
dataprep.clean.clean_fr_siret.
clean_fr_siret
Clean French Company Establishment Identification Numbers (SIRETs) in a DataFrame column.
col – The name of the column containing data of SIRET type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘siren’, convert the SIRET number to a SIREN number. If output_format = ‘tva’, convert the SIRET number to a TVA number.
Clean a column of SIRET data.
>>> df = pd.DataFrame({{ "siret": [ "73282932000074", "73282932000079"] }) >>> clean_fr_siret(df, 'siret') siret siret_clean 0 73282932000074 732 829 320 00074 1 73282932000079 NaN
validate_fr_siret
Validate if a data cell is SIRET in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing United Kingdom National Health Service patient identifier (NHSs).
dataprep.clean.clean_gb_nhs.
clean_gb_nhs
Clean United Kingdom NHS numbers (NHSs) type data in a DataFrame column.
col – The name of the column containing data of NHS type.
Clean a column of NHS data.
>>> df = pd.DataFrame({{ "nhs": [ "9434765870", "9434765871"] }) >>> clean_gb_nhs(df, 'nhs') nhs nhs_clean 0 9434765870 943 476 5870 1 9434765871 NaN
validate_gb_nhs
Validate if a data cell is NHS in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Burgerservicenummer, the Dutch citizen identification numbers (BSNs).
dataprep.clean.clean_nl_bsn.
clean_nl_bsn
Clean Burgerservicenummer (BSNs) type data in a DataFrame column.
col – The name of the column containing data of BSN type.
Clean a column of BSN data.
>>> df = pd.DataFrame({{ "bsn": [ "111222333", "1112223334",] }) >>> clean_nl_bsn(df, 'bsn') bsn bsn_clean 0 111222333 1112.22.333 1 1112223334 NaN
validate_nl_bsn
Validate if a data cell is BSN in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Onderwijsnummer, the Dutch student identification number.
dataprep.clean.clean_nl_onderwijsnummer.
clean_nl_onderwijsnummer
Clean Onderwijsnummer type data in a DataFrame column.
col – The name of the column containing data of onderwijsnummer type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of onderwijsnummer, the compact is the same as the standard one.
Clean a column of onderwijsnummer data.
>>> df = pd.DataFrame({{ "onderwijsnummer": [ '1012.22.331', '2112.22.337',] }) >>> clean_nl_onderwijsnummer(df, 'onderwijsnummer') onderwijsnummer onderwijsnummer_clean 0 1012.22.331 0403019261 1 2112.22.337 NaN
validate_nl_onderwijsnummer
Validate if a data cell is onderwijsnummer in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Bulgarian personal number of a foreigner.
dataprep.clean.clean_bg_pnf.
clean_bg_pnf
Clean Bulgarian personal number of a foreigner type data in a DataFrame column.
col – The name of the column containing data of PNF type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of PNF, the compact format is the same as the standard one.
Clean a column of PNF data.
>>> df = pd.DataFrame({{ "pnf": [ '7111 042 925', '7111042922'] # invalid check digit }) >>> clean_bg_pnf(df, 'pnf') pnf pnf_clean 0 7111 042 925 7111042925 1 7111042922 NaN
validate_bg_pnf
Validate if a data cell is PNF in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing CPF numbers, Brazilian national identifier.
dataprep.clean.clean_br_cpf.
clean_br_cpf
Clean Brazilian national identifier data in a DataFrame column.
>>> df = pd.DataFrame({{ "cpf": [ '23100299900', '231.002.999-00', # InvalidChecksum '390.533.447=0'] # invalid delimiter }) >>> clean_br_cpf(df, 'cpf') cpf cpf_clean 0 23100299900 231.002.999-00 1 231.002.999-00 NaN 2 390.533.447=0 NaN
validate_br_cpf
Clean and validate a DataFrame column containing Canadian Social Insurance Numbers(SINs).
dataprep.clean.clean_ca_sin.
clean_ca_sin
Clean Canadian Social Insurance Numbers(SINs) type data in a DataFrame column.
col – The name of the column containing data of SIN type.
Clean a column of SIN data.
>>> df = pd.DataFrame({{ "sin": [ '123456782', '12345678Z',] }) >>> clean_ca_sin(df, 'sin') sin sin_clean 0 123456782 123-456-782 1 12345678Z NaN
validate_ca_sin
Validate if a data cell is SIN in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Chinese Unified Social Credit Code (China tax number) (USCCs).
dataprep.clean.clean_cn_uscc.
clean_cn_uscc
Clean Chinese Unified Social Credit Code (USCCs) type data in a DataFrame column.
col – The name of the column containing data of USCC type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note that in the case of USCC, the compact format is the same as the standard one.
Clean a column of USCC data.
>>> df = pd.DataFrame({{ "uscc": [ "9 1 110000 600037341L", "A1110000600037341L"] }) >>> clean_cn_uscc(df, 'uscc') uscc uscc_clean 0 9 1 110000 600037341L 91110000600037341L 1 A1110000600037341L NaN
validate_cn_uscc
Validate if a data cell is USCC in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Estonian organisation registration codes.
dataprep.clean.clean_ee_registrikood.
clean_ee_registrikood
Clean Estonian organisation registration codes (Registrikoods) type data in a DataFrame column.
col – The name of the column containing data of Registrikood type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of Registrikood, the compact format is the same as the standard one.
Clean a column of Registrikood data.
>>> df = pd.DataFrame({{ "registrikood": [ '12345678', '12345679'] }) >>> clean_ee_registrikood(df, 'registrikood') registrikood registrikood_clean 0 12345678 12345678 1 12345679 NaN
validate_ee_registrikood
Validate if a data cell is Registrikood in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Spanish real state ids.
dataprep.clean.clean_es_referenciacatastral.
clean_es_referenciacatastral
Clean Spanish real state ids (Referencia Catastrals) type data in a DataFrame column.
col – The name of the column containing data of Referencia Catastral type.
Clean a column of Referencia Catastral data.
>>> df = pd.DataFrame({{ "referenciacatastral": [ "4A08169P03PRAT0001LR", "7837301/VG8173B 0001 TT",] }) >>> clean_es_referenciacatastral(df, 'referenciacatastral') referenciacatastral referenciacatastral_clean 0 4A08169P03PRAT0001LR 4A08169 P03PRAT 0001 LR 1 7837301/VG8173B 0001 TT NaN
validate_es_referenciacatastral
Validate if a data cell is Referencia Catastral in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Euro banknote serial numbers.
dataprep.clean.clean_eu_banknote.
clean_eu_banknote
Clean Euro banknote serial numbers type data in a DataFrame column.
col – The name of the column containing data of Euro banknote type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of Euro banknote, the compact format is the same as the standard one.
Clean a column of Euro banknote data.
>>> df = pd.DataFrame({{ "banknote": [ 'P36007033744', 'P36007033743'] }) >>> clean_eu_banknote(df, 'banknote') banknote banknote_clean 0 P36007033744 P36007033744 1 P36007033743 NaN
validate_eu_banknote
Validate if a data cell is Euro banknote in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing European Energy Identification Codes.
dataprep.clean.clean_eu_eic.
clean_eu_eic
Clean European Energy Identification Codes (EICs) type data in a DataFrame column.
col – The name of the column containing data of EIC type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of EIC, the compact format is the same as the standard one.
Clean a column of EIC data.
>>> df = pd.DataFrame({{ "eic": [ '22XWATTPLUS----G', '22XWATTPLUS----X'] }) >>> clean_eu_eic(df, 'eic') eic eic_clean 0 22XWATTPLUS----G 22XWATTPLUS----G 1 22XWATTPLUS----X NaN
validate_eu_eic
Validate if a data cell is EIC in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Finnish association registry ids.
dataprep.clean.clean_fi_associationid.
clean_fi_associationid
Clean Finnish association registry ids type data in a DataFrame column.
col – The name of the column containing data of Finnish association registry id.
Clean a column of Finnish association registry id data.
>>> df = pd.DataFrame({ "associationid": [ "1234", "12df",] }) >>> clean_fi_associationid(df, 'associationid') associationid associationid_clean 0 1234 1.234 1 12df NaN
validate_fi_associationid
Validate if a data cell is Finnish association registry id in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Finnish individual tax numbers.
dataprep.clean.clean_fi_veronumero.
clean_fi_veronumero
Clean Finnish individual tax numbers type data in a DataFrame column.
col – The name of the column containing data of Veronumero type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of Veronumero, the compact format is the same as the standard one.
Clean a column of Veronumero data.
>>> df = pd.DataFrame({{ "veronumero": [ '123456789123', '12345678912A'] }) >>> clean_fi_veronumero(df, 'veronumero') veronumero veronumero_clean 0 123456789123 123456789123 1 12345678912A NaN
validate_fi_veronumero
Validate if a data cell is Veronumero in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Dutch postal codes.
dataprep.clean.clean_nl_postcode.
clean_nl_postcode
Clean Dutch postal codes type data in a DataFrame column.
col – The name of the column containing data of postcode type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of postcode, the compact format is the same as the standard one.
Clean a column of postcode data.
>>> df = pd.DataFrame({{ "postcode": [ 'NL-2611ET', '26112 ET',] }) >>> clean_nl_postcode(df, 'postcode') postcode postcode_clean 0 NL-2611ET 2611ET 1 26112 ET NaN
validate_nl_postcode
Validate if a data cell is postcode in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing Norwegian birth numbers.
dataprep.clean.clean_no_fodselsnummer.
clean_no_fodselsnummer
Clean Norwegian birth number data in a DataFrame column.
col – The name of the column containing data of fodselsnummer type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, get the person’s birthdate. If output_format = ‘gender’, get the person’s birth gender (‘M’ or ‘F’).
Clean a column of fodselsnummer data.
>>> df = pd.DataFrame({{ "fodselsnummer": [ '15108695088', '15108695077'] }) >>> clean_no_fodselsnummer(df, 'fodselsnummer') fodselsnummer fodselsnummer_clean 0 15108695088 151086 95088 1 15108695077 NaN
validate_no_fodselsnummer
Validate if a data cell is fodselsnummer in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing New Zealand bank account numbers.
dataprep.clean.clean_nz_bankaccount.
clean_nz_bankaccount
Clean New Zealand bank account numbers type data in a DataFrame column.
col – The name of the column containing data of bankaccount type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘info’, return a dictionary of data about the supplied number.
This typically returns the name of the bank and branch and a BIC if it is valid.
Clean a column of bankaccount data.
>>> df = pd.DataFrame({ "bankaccount": [ "0102420100194000", "01-0242-0100195-00",] }) >>> clean_nz_bankaccount(df, 'bankaccount') bankaccount bankaccount_clean 0 0102420100194000 01-0242-0100194-000 1 01-0242-0100195-00 NaN
validate_nz_bankaccount
Validate if a data cell is bankaccount in a DataFrame column. For each cell, return True or False.
Clean and validate a DataFrame column containing AT-02 (SEPA Creditor identifier).
dataprep.clean.clean_eu_at_02.
clean_eu_at_02
Clean AT-02 (SEPA Creditor identifier) type data in a DataFrame column.
col – The name of the column containing data of AT-02 type.
The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of AT-02, the compact format is the same as the standard one.
Clean a column of AT-02 data.
>>> df = pd.DataFrame({{ "at_02": [ 'ES++()+23ZZZ4//7690558N', 'ES2900047690558N'] }) >>> clean_eu_at_02(df, 'at_02') at_02 at_02_clean 0 ES++()+23ZZZ4//7690558N ES23ZZZ47690558N 1 ES2900047690558N NaN
validate_eu_at_02
Validate if a data cell is AT-02 in a DataFrame column. For each cell, return True or False.