dataprep.clean

API reference for the DataPrep.Clean subpackage.

Column Headers

Clean and standardize column headers for a DataFrame.

dataprep.clean.clean_headers.clean_headers(df, case='snake', replace=None, remove_accents=True, report=True)[source]

Function to clean column headers (column names).

Read more in the User Guide.

Parameters
  • df (Union[DataFrame, DataFrame]) – Dataframe from which column names are to be cleaned.

  • case (str) –

    The desired case style of the column name.
    • ’snake’: ‘column_name’

    • ’kebab’: ‘column-name’

    • ’camel’: ‘columnName’

    • ’pascal’: ‘ColumnName’

    • ’const’: ‘COLUMN_NAME’

    • ’sentence’: ‘Column name’

    • ’title’: ‘Column Name’

    • ’lower’: ‘column name’

    • ’upper’: ‘COLUMN NAME’

    (default: ‘snake’)

  • replace (Optional[Dict[str, str]]) –

    Values to replace in the column names.
    • {‘old_value’: ‘new_value’}

    (default: None)

  • remove_accents (bool) –

    If True, strip accents from the column names.

    (default: True)

  • report (bool) –

    If True, output the summary report. Otherwise, no report is outputted.

    (default: True)

Examples

Clean column names by converting the names to camel case style, removing accents, and correcting a mispelling.

>>> df = pd.DataFrame({'FirstNom': ['Philip', 'Turanga'], 'lastName': ['Fry', 'Leela'], 'Téléphone': ['555-234-5678', '(604) 111-2335']})
>>> clean_headers(df, case='camel', replace={'Nom': 'Name'})
Column Headers Cleaning Report:
    2 values cleaned (66.67%)
  firstName lastName       telephone
0    Philip      Fry    555-234-5678
1   Turanga    Leela  (604) 111-2335
Return type

DataFrame

Country Names

Clean and validate a DataFrame column containing country names.

dataprep.clean.clean_country.clean_country(df, column, input_format='auto', output_format='name', fuzzy_dist=0, strict=False, inplace=False, errors='coerce', report=True, progress=True)[source]

Clean and standardize country names.

Read more in the User Guide.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • column (str) – The name of the column containing country names.

  • input_format (Union[str, Tuple[str, …]]) –

    The ISO 3166 input format of the country.
    • ’auto’: infer the input format

    • ’name’: country name (‘United States’)

    • ’official’: official state name (‘United States of America’)

    • ’alpha-2’: alpha-2 code (‘US’)

    • ’alpha-3’: alpha-3 code (‘USA’)

    • ’numeric’: numeric code (840)

    Can also be a tuple containing any combination of input formats, for example to clean a column containing alpha-2 and numeric codes set input_format to (‘alpha-2’, ‘numeric’).

    (default: ‘auto’)

  • output_format (str) –

    The desired ISO 3166 format of the country:
    • ’name’: country name (‘United States’)

    • ’official’: official state name (‘United States of America’)

    • ’alpha-2’: alpha-2 code (‘US’)

    • ’alpha-3’: alpha-3 code (‘USA’)

    • ’numeric’: numeric code (840)

    (default: ‘name’)

  • fuzzy_dist (int) –

    The maximum edit distance (number of single character insertions, deletions or substitutions required to change one word into the other) between a country value and input that will count as a match. Only applies to ‘auto’, ‘name’ and ‘official’ input formats.

    (default: 0)

  • strict (bool) –

    If True, matching for input formats ‘name’ and ‘official’ are done by looking for a direct match. If False, matching is done by searching the input for a regex match.

    (default: False)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors.
    • ‘coerce’: invalid parsing will be set to NaN.

    • ‘ignore’: invalid parsing will return the input.

    • ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • report (bool) –

    If True, output the summary report. Otherwise, no report is outputted.

    (default: True)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

>>> df = pd.DataFrame({'country': [' Canada ', 'US']})
>>> clean_country(df, 'country')
Country Cleaning Report:
    2 values cleaned (100.0%)
Result contains 2 (100.0%) values in the correct format and 0 null values (0.0%)
    country  country_clean
0   Canada          Canada
1        US  United States
Return type

DataFrame

dataprep.clean.clean_country.validate_country(x, input_format='auto', strict=True)[source]

Validate country names.

Read more in the User Guide.

Parameters
  • x (Union[str, int, Series]) – pandas Series of countries or str/int country value.

  • input_format (Union[str, Tuple[str, …]]) –

    The ISO 3166 input format of the country.
    • ’auto’: infer the input format

    • ’name’: country name (‘United States’)

    • ’official’: official state name (‘United States of America’)

    • ’alpha-2’: alpha-2 code (‘US’)

    • ’alpha-3’: alpha-3 code (‘USA’)

    • ’numeric’: numeric code (840)

    Can also be a tuple containing any combination of input formats, for example to clean a column containing alpha-2 and numeric codes set input_format to (‘alpha-2’, ‘numeric’).

    (default: ‘auto’)

  • strict (bool) –

    If True, matching for input formats ‘name’ and ‘official’ are done by looking for a direct match, if False, matching is done by searching the input for a regex match.

    (default: False)

Examples

>>> validate_country('United States')
True
>>> df = pd.DataFrame({'country': ['Canada', 'NaN']})
>>> validate_country(df['country'])
0     True
1    False
Name: country, dtype: bool
Return type

Union[bool, Series]

Dates and Times

Clean and validate a DataFrame column containing dates and times.

dataprep.clean.clean_date.clean_date(df, column, output_format='YYYY-MM-DD hh:mm:ss', input_timezone='UTC', output_timezone='', fix_missing='minimum', infer_day_first=True, inplace=False, errors='coerce', report=True, progress=True)[source]

Clean and standardize dates and times.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • column (str) – The name of the column containing dates.

  • output_format (str) –

    The desired format of the date.

    (default: ‘YYYY-MM-DD hh:mm:ss’)

  • input_timezone (str) –

    Time zone of the input date.

    (default: ‘UTC’)

  • output_timezone (str) –

    The desired time zone of the date.

    (default: ‘’)

  • fix_missing (str) –

    Specify how to fill missing components of a date value.
    • ’minimum’: fill hours, minutes, seconds with zeros, and month, day, year with January 1st, 2000.

    • ’current’: fill with the current date and time.

    • ’empty’: don’t fill missing components.

    (default: ‘minimum’)

  • infer_day_first (bool) – If True, the program will infer the ambiguous format ‘09-10-03’ and ‘25-09-03’ according to ‘25-09-03’ (day is the number of first position). The result should be ‘2003-10-09’ and ‘2003-09-25’. If False, do nothing of inferring. The result should be ‘2003-09-10’ and ‘2003-09-25’. (default: False)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors.
    • ‘coerce’: invalid parsing will be set to NaN.

    • ‘ignore’: invalid parsing will return the input.

    • ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • report (bool) –

    If True, output the summary report. Otherwise, no report is outputted.

    (default: True)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

>>> df = pd.DataFrame({'date': ['Thu Sep 25 2003', 'Thu 10:36:28', '2003 09 25']})
>>> clean_date(df, 'date')
Dates Cleaning Report:
    3 values cleaned (100.0%)
Result contains 3 (100.0%) values in the correct format and 0 null values (0.0%)
            date           date_clean
0  Thu Sep 25 2003  2003-09-25 00:00:00
1     Thu 10:36:28  2000-01-01 10:36:28
2       2003 09 25  2003-09-25 00:00:00
Return type

DataFrame

dataprep.clean.clean_date.validate_date(date)[source]

Validate dates and times.

Parameters

date (Union[str, Series]) – pandas Series of dates or a date string

Examples

>>> validate_date('3rd of May 2001')
True
>>> df = pd.DataFrame({'date': ['2003/09/25', 'This is Sep.']})
>>> validate_date(df['date'])
0     True
1    False
Name: date, dtype: bool
Return type

Union[bool, Series]

Duplicate Values

Clean a DataFrame column containing duplicate values.

class dataprep.clean.clean_duplication.UserInterface(df, col_name, df_name, page_size)[source]

Bases: object

A user interface used by the clean_duplication function.

display()[source]

Display the UI.

Return type

Box

dataprep.clean.clean_duplication.clean_duplication(df, column, df_var_name='default', page_size=5)[source]

Cleans and standardizes duplicate values in a DataFrame.

Read more in the User Guide.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • column (str) – The name of the column containing duplicate values.

  • df_var_name (str) –

    Optional parameter containing the variable name of the DataFrame being cleaned. This is only needed for legacy compatibility with the original veraion of this function, which needed it to produce correct exported code.

    (default: ‘default’)

  • page_size (int) –

    The number of clusters to display on each page.

    (default: 5)

Examples

After running clean_duplication(df, ‘city’) below in a notebook, a GUI will appear. Select the merge checkbox, press merge and re-cluster, then press finish.

>>> df = pd.DataFrame({'city': ['New York', 'new york']})
>>> clean_duplication(df, 'city')

city

0 New York 1 New York

Return type

Box

Email Addresses

Clean and validate a DataFrame column containing email addresses.

dataprep.clean.clean_email.clean_email(df, column, remove_whitespace=False, fix_domain=False, split=False, inplace=False, errors='coerce', report=True, progress=True)[source]

Clean and standardize email address.

Read more in the User Guide.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • column (str) – The name of the column containing email addresses.

  • remove_whitespace (bool) –

    If True, remove all whitespace from the input value before verifying and cleaning it.

    (default: False)

  • fix_domain (bool) –

    If True, for invalid email domains, try to fix it using 4 strategies:
    • Swap neighboring characters.

    • Add a single character.

    • Remove a single character.

    • Swap each character with its nearby keys on the qwerty keyboard.

    The first valid domain found will be returned.

    (default: False)

  • split (bool) –

    If True, split a column containing email addresses into one column for the usernames and another column for the domains.

    (default: False)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors.
    • ‘coerce’: invalid parsing will be set to NaN.

    • ‘ignore’: invalid parsing will return the input.

    • ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • report (bool) –

    If True, output the summary report. Else, no report is outputted.

    (default: True)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

>>> df = pd.DataFrame({'email': ['Abc.example.com', 'Abc@example.com', 'H ELLO@hotmal.COM']})
>>> clean_email(df, 'email')
Email Cleaning Report:
    2 values with bad format (66.67%)
Result contains 1 (33.33%) values in the correct format and 2 null values (66.67%)
            email      email_clean
0    Abc.example.com              NaN
1    Abc@example.com  abc@example.com
2  H ELLO@hotmal.COM              NaN
Return type

DataFrame

dataprep.clean.clean_email.validate_email(x)[source]

Validate email addresses.

Read more in the User Guide.

Parameters

x (Union[str, Series]) – pandas Series of emails or a string containing an email.

Examples

>>> validate_email('Abc.example@com')
False
>>> df = pd.DataFrame({'email': ['abc.example.com', 'HELLO@HOTMAIL.COM']})
>>> validate_email(df['email'])
0    False
1     True
Name: email, dtype: bool
Return type

Union[bool, Series]

Geographic Coordinates

Clean and validate a DataFrame column containing geographic coordinates.

dataprep.clean.clean_lat_long.clean_lat_long(df, lat_long=None, *, lat_col=None, long_col=None, output_format='dd', split=False, inplace=False, errors='coerce', report=True, progress=True)[source]

Clean and standardize latitude and longitude coordinates.

Read more in the User Guide.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • lat_long (Optional[str]) – The name of the column containing latitude and longitude coordinates.

  • lat_col (Optional[str]) –

    The name of the column containing latitude coordinates.

    If specified, the parameter lat_long must be None.

  • long_col (Optional[str]) –

    The name of the column containing longitude coordinates.

    If specified, the parameter lat_long must be None.

  • output_format (str) –

    The desired format of the coordinates.
    • ’dd’: decimal degrees (51.4934, 0.0098)

    • ’ddh’: decimal degrees with hemisphere (‘51.4934° N, 0.0098° E’)

    • ’dm’: degrees minutes (‘51° 29.604′ N, 0° 0.588′ E’)

    • ’dms’: degrees minutes seconds (‘51° 29′ 36.24″ N, 0° 0′ 35.28″ E’)

    (default: ‘dd’)

  • split (bool) –

    If True, split the latitude and longitude coordinates into one column for latitude and a separate column for longitude. Otherwise, merge the latitude and longitude coordinates into one column.

    (default: False)

  • inplace (bool) –

    If True, delete the column(s) containing the data that was cleaned. Otherwise, keep the original column(s).

    (default: False)

  • errors (str) –

    How to handle parsing errors.
    • ‘coerce’: invalid parsing will be set to NaN.

    • ‘ignore’: invalid parsing will return the input.

    • ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • report (bool) –

    If True, output the summary report. Otherwise, no report is outputted.

    (default: True)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Split a column containing latitude and longitude strings into separate columns in decimal degrees format.

>>> df = pd.DataFrame({'coord': ['51° 29′ 36.24″ N, 0° 0′ 35.28″ E', '51.4934° N, 0.0098° E']})
>>> clean_lat_long(df, 'coord', split=True)
Latitude and Longitude Cleaning Report:
    2 values cleaned (100.0%)
Result contains 2 (100.0%) values in the correct format and 0 null values (0.0%)
                        coord  latitude  longitude
0  51° 29′ 36.24″ N, 0° 0′ 35.28″ E   51.4934     0.0098
1             51.4934° N, 0.0098° E   51.4934     0.0098
Return type

DataFrame

dataprep.clean.clean_lat_long.validate_lat_long(x, *, lat_long=True, lat=False, lon=False)[source]

Validate latitude and longitude coordinates.

Read more in the User Guide.

Parameters
  • x (Union[Series, str, float, Tuple[float, float]]) – A pandas Series, string, float, or tuple of floats, containing the latitude and/or longitude coordinates to be validated.

  • lat_long (bool) –

    If True, valid values contain latitude and longitude coordinates. Parameters lat and lon must be False if lat_long is True.

    (default: True)

  • lat (bool) –

    If True, valid values contain only latitude coordinates. Parameters

    lat_long and lon must be False if lat is True.

    (default: False)

  • lon (bool) –

    If True, valid values contain only longitude coordinates. Parameters lat_long and lat must be False if lon is True.

    (default: False)

Examples

Validate a coordinate string or series of coordinates.

>>> validate_lat_long('51° 29′ 36.24″ N, 0° 0′ 35.28″ E')
True
>>> df = pd.DataFrame({'coordinates', ['51° 29′ 36.24″ N, 0° 0′ 35.28″ E', 'NaN']})
>>> validate_lat_long(df['coordinates'])
0     True
1    False
Name: coordinates, dtype: bool
Return type

Union[bool, Series]

IP Addresses

Clean and validate a DataFrame column containing IP addresses.

dataprep.clean.clean_ip.clean_ip(df, column, input_format='auto', output_format='compressed', inplace=False, errors='coerce', report=True, progress=True)[source]

Clean and standardize IP addresses.

Read more in the User Guide.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • column (str) – The name of the column containing IP addresses.

  • input_format (str) –

    The input format of the IP addresses.
    • ’auto’: parse both ipv4 and ipv6 addresses.

    • ’ipv4’: only parse ipv4 addresses.

    • ’ipv6’: only parse ipv6 addresses.

    (default: ‘auto’)

  • output_format (str) –

    The desired output format of the IP addresses.
    • ’compressed’: compressed representation (‘12.3.4.5’)

    • ’full’: full representation (‘0012.0003.0004.0005’)

    • ’binary’: binary representation (‘00001100000000110000010000000101’)

    • ’hexa’: hexadecimal representation (‘0xc030405’)

    • ’integer’: integer representation (201524229)

    • ’packed’: packed binary representation (big-endian, a bytes object)

    (default: ‘compressed’)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors.
    • ‘coerce’: invalid parsing will be set to NaN.

    • ‘ignore’: invalid parsing will return the input.

    • ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • report (bool) –

    If True, output the summary report. Otherwise, no report is outputted.

    (default: True)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

>>> df = pd.DataFrame({'ip': ['2001:0db8:85a3:0000:0000:8a2e:0370:7334', '233.5.6.000']})
>>> clean_ip(df, 'ip')
IP Cleaning Report:
    2 values cleaned (100.0%)
Result contains 2 (100.0%) values in the correct format and 0 null values (0.0%)
                                        ip                      ip_clean
0  2001:0db8:85a3:0000:0000:8a2e:0370:7334  2001:db8:85a3::8a2e:370:7334
1                              233.5.6.000                     233.5.6.0
Return type

Union[DataFrame, DataFrame]

dataprep.clean.clean_ip.validate_ip(x, input_format='auto')[source]

Validate IP addresses.

Read more in the User Guide.

Parameters
  • x (Union[str, Series]) – pandas Series of IP addresses or a str ip address value

  • input_format (str) –

    The IP address format to validate.
    • ’auto’: validate both ipv4 and ipv6 addresses.

    • ’ipv4’: only validate ipv4 addresses.

    • ’ipv6’: only validate ipv6 addresses.

    (default: ‘auto’)

Examples

>>> validate_ip('fdf8:f53b:82e4::53')
True
>>> df = pd.DataFrame({'ip': ['fdf8:f53b:82e4::53', None]})
>>> validate_ip(df['ip'])
0     True
1    False
Name: ip, dtype: bool
Return type

Union[bool, Series]

Phone Numbers

Clean and validate a DataFrame column containing phone numbers.

dataprep.clean.clean_phone.clean_phone(df, column, output_format='nanp', fix_missing='empty', split=False, inplace=False, errors='coerce', report=True, progress=True)[source]

Clean and standardize phone numbers.

Read more in the User Guide.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • column (str) – The name of the column containing phone numbers.

  • output_format (str) –

    The desired format of the phone numbers.
    • ’nanp’: ‘NPA-NXX-XXXX’

    • ’e164’: ‘+1NPANXXXXXX’

    • ’national’: ‘(NPA) NXX-XXXX’

    (default: ‘nanp’)

  • fix_missing (str) –

    Fix the missing country code of a parsed phone number.
    • ’empty’: leave the missing component as is.

    • ’auto’: set the country code to a default value (1).

    (default: ‘empty’)

  • split (bool) –

    If True, split a column containing a phone number into different columns containing individual components.

    (default: False)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors.
    • ‘coerce’: invalid parsing will be set to NaN.

    • ‘ignore’: invalid parsing will return the input.

    • ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • report (bool) –

    If True, output the summary report. Else, no report is outputted.

    (default: True)

  • progress (bool) –

    If True, enable the progress bar.

    (default: True)

Examples

>>> df = pd.DataFrame({'phone': ['555-234-5678', '(555) 234-5678', '555.234.5678']})
>>> clean_phone(df, 'phone')
Phone Number Cleaning Report:
    2 values cleaned (66.67%)
Result contains 3 (100.0%) values in the correct format and 0 null values (0.0%)
            phone   phone_clean
0    555-234-5678  555-234-5678
1  (555) 234-5678  555-234-5678
2    555.234.5678  555-234-5678
Return type

DataFrame

dataprep.clean.clean_phone.validate_phone(x)[source]

Validate phone numbers.

Read more in the User Guide.

Parameters

x (Union[str, Series]) – pandas Series of phone numbers or a string/int containing a phone number.

Examples

>>> validate_phone('1 800 234 6789')
True
>>> df = pd.DataFrame({'phone': [1234567, '1234']})
>>> validate_phone(df['phone'])
0     True
1    False
Name: phone, dtype: bool
Return type

Union[bool, Series]

Text

Clean a DataFrame column containing text data.

dataprep.clean.clean_text.clean_text(df, column, pipeline=None, stopwords=None)[source]

Clean text data in a DataFrame column.

Read more in the User Guide.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • column (str) – The name of the column containing text data.

  • pipeline (Optional[List[Dict[str, Any]]]) –

    A list of cleaning functions to be applied to the column. If None, use the default pipeline. See the User Guide for more information on customizing the pipeline.

    (default: None)

  • stopwords (Optional[Set[str]]) –

    A set of words to be removed from the column. If None, use NLTK’s stopwords.

    (default: None)

Examples

Clean a column of text data using the default pipeline.

>>> df = pd.DataFrame({"text": ["This show was an amazing, fresh & innovative idea in the 70's when it first aired."]})
>>> clean_text(df, 'text')
                                             text
0  show amazing fresh innovative idea first aired
Return type

DataFrame

dataprep.clean.clean_text.default_text_pipeline()[source]

Return a list of dictionaries representing the functions in the default pipeline. Use as a template for creating a custom pipeline.

Read more in the User Guide.

Examples

>>> default_text_pipeline()
[{'operator': 'fillna'}, {'operator': 'lowercase'}, {'operator': 'remove_digits'},
{'operator': 'remove_html'}, {'operator': 'remove_urls'}, {'operator': 'remove_punctuation'},
{'operator': 'remove_accents'}, {'operator': 'remove_stopwords', 'parameters':
{'stopwords': None}}, {'operator': 'remove_whitespace'}]
Return type

List[Dict[str, Any]]

URLs

Clean and validate a DataFrame column containing URLs.

dataprep.clean.clean_url.clean_url(df, column, remove_auth=False, inplace=False, split=False, errors='coerce', report=True, progress=True)[source]

Clean and standardize URLs.

Read more in the User Guide.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • column (str) – The name of the column containing URL addresses.

  • remove_auth (Union[bool, List[str]]) –

    Can be a boolean value or list of strings representing the names of Auth queries to be removed. If True, remove default Auth values. If False, do not remove Auth values.

    (default: False)

  • split (bool) –

    If True, split the URL into the scheme, hostname, queries, cleaned_url columns. If False, return a column of dictionaries with the relavant information (e.g., scheme, hostname, etc.) as key-value pairs.

    (default: False)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors.
    • ‘coerce’: invalid parsing will be set to NaN.

    • ‘ignore’: invalid parsing will return the input.

    • ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • report (bool) –

    If True, output the summary report. Otherwise, no report is outputted.

    (default: True)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Split a URL into its components.

>>> df = pd.DataFrame({'url': ['https://github.com/sfu-db/dataprep','https://www.google.com/']})
>>> clean_url(df, 'url')
URL Cleaning Report:
    2 values parsed (100.0%)
Result contains 2 (100.0%) parsed key-value pairs and 0 null values (0.0%)
                                url                                        url_details
0  https://github.com/sfu-db/dataprep  {'scheme': 'https', 'host': 'github.com', 'url...
1             https://www.google.com/  {'scheme': 'https', 'host': 'www.google.com', ...
Return type

Union[DataFrame, DataFrame]

dataprep.clean.clean_url.validate_url(x)[source]

Validate URLs.

Read more in the User Guide.

Parameters

x (Union[str, Series]) – pandas Series of URLs or string URL.

Examples

>>> validate_url('https://github.com/sfu-db/dataprep')
True
>>> df = pd.DataFrame({'url': ['https://www.google.com/', 'NaN']})
>>> validate_url(df['url'])
0     True
1    False
Name: url, dtype: bool
Return type

Union[bool, Series]

US Street Addresses

Clean and validate a DataFrame column containing US street addresses.

dataprep.clean.clean_address.clean_address(df, column, output_format='(building) house_number street_prefix_abbr street_name street_suffix_abbr, apartment, city, state_abbr zipcode', must_contain=('house_number', 'street_name'), split=False, inplace=False, errors='coerce', report=True, progress=True)[source]

Clean and standardize US street addresses.

Read more in the User Guide.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • column (str) – The name of the column containing addresses.

  • output_format (str) –

    The output format can be specified using the following keywords.
    • ’house_number’: ‘1234’

    • ’street_prefix_abbr’: ‘N’, ‘S’, ‘E’, or ‘W’

    • ’street_prefix_full’: ‘North’, ‘South’, ‘East’, or ‘West’

    • ’street_name’: ‘Main’

    • ’street_suffix_abbr’: ‘St’, ‘Ave’

    • ’street_suffix_full’: ‘Street’, ‘Avenue’

    • ’apartment’: ‘Apt 1’

    • ’building’: ‘Staples Center’

    • ’city’: ‘Los Angeles’

    • ’state_abbr’: ‘CA’

    • ’state_full’: ‘California’

    • ’zipcode’: ‘57903’

    The output_format can contain ‘\t’ characters to specify how to split the output into columns.

    (default: ‘(building) house_number street_prefix_abbr street_name street_suffix_abbr, apartment, city, state_abbr zipcode’)

  • must_contain (Tuple[str, …]) –

    A tuple containing parts of the address that must be included for the address to be successfully cleaned.

    • ’house_number’: ‘1234’

    • ’street_prefix’: ‘N’, ‘North’

    • ’street_name’: ‘Main’

    • ’street_suffix’: ‘St’, ‘Avenue’

    • ’apartment’: ‘Apt 1’

    • ’building’: ‘Staples Center’

    • ’city’: ‘Los Angeles’

    • ’state’: ‘CA’, ‘California’

    • ’zipcode’: ‘57903’

    (default: (‘house_number’, ‘street_name’))

  • split (bool) –

    If True, each component of the address specified by the output_format parameter will be put into it’s own column.

    For example if output_format = “house_number street_name” and split = True, then there will be one column for house_number and another for street_name.

    (default: False)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors.
    • ‘coerce’: invalid parsing will be set to NaN.

    • ‘ignore’: invalid parsing will return the input.

    • ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • report (bool) –

    If True, output the summary report. Otherwise, no report is outputted.

    (default: True)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean addresses and add the house number and street name to separate columns.

>>> df = pd.DataFrame({'address': ['123 pine avenue', '1234 w main st 57033']})
>>> clean_address(df, 'address', output_format='house_number \t street_name')
Address Cleaning Report:
        2 values cleaned (100.0%)
Result contains 2 (100.0%) values in the correct format and 0 null values (0.0%)
    address                house_number      street_name
0    123 pine avenue           123             Pine
1   1234 w main st 57033       1234            Main
Return type

DataFrame

dataprep.clean.clean_address.validate_address(x, must_contain=('house_number', 'street_name'))[source]

Validate US street addresses.

Read more in the User Guide.

Parameters
  • x (Union[str, Series]) – pandas Series of addresses or a string containing an address.

  • must_contain (Tuple[str, …]) –

    A tuple containing parts of the address that must be included for the address to be successfully cleaned.

    • ’house_number’: ‘1234’

    • ’street_prefix’: ‘N’, ‘North’

    • ’street_name’: ‘Main’

    • ’street_suffix’: ‘St’, ‘Avenue’

    • ’apartment’: ‘Apt 1’

    • ’building’: ‘Staples Center’

    • ’city’: ‘Los Angeles’

    • ’state’: ‘CA’, ‘California’

    • ’zipcode’: ‘57903’

    (default: (‘house_number’, ‘street_name’))

Examples

>>> df = pd.DataFrame({'address': ['123 pine avenue', 'NULL']})
>>> validate_address(df['address'])
0    True
1    False
Name: address, dtype: bool
Return type

Union[bool, Series]

ISBN Numbers

Clean and validate a DataFrame column containing ISBN numbers.

dataprep.clean.clean_isbn.clean_isbn(df, column, output_format='standard', split=False, inplace=False, errors='coerce', progress=True)[source]

Clean ISBN type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • column (str) – The name of the column containing data of ISBN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators. If output_format = ‘standard’, return string with proper separators. If output_format = ‘isbn13’, return ISBN string with 13 digits. If output_format = ‘isbn10’, return ISBN string with 10 digits.

    (default: “standard”)

  • split (bool) –

    If True,

    each component of derived from its number string will be put into its own column.

    (default: False)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of ISBN data.

>>> df = pd.DataFrame({{
        "isbn": [
        "978-9024538270",
        "978-9024538271"]
        })
>>> clean_isbn(df, 'isbn', inplace=True)
       isbn_clean
0  978-90-245-3827-0
1         NaN
Return type

DataFrame

dataprep.clean.clean_isbn.validate_isbn(df, column='')[source]

Validate if a data cell is ISBN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • column (str) – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

For Downstreaming ML Tasks

Implement clean_ml function

dataprep.clean.clean_ml.clean_ml(training_df, test_df, target='target', cat_imputation='constant', cat_null_value=None, fill_val='missing_value', num_imputation='mean', num_null_value=None, cat_encoding='one_hot', variance_threshold=False, variance=0.0, num_scaling='standardize', include_operators=None, exclude_operators=None, customized_cat_pipeline=None, customized_num_pipeline=None)[source]

This function transforms an arbitrary tabular dataset into a format that’s suitable for a typical ML application.

Parameters
  • training_df (Union[DataFrame, DataFrame]) – Training dataframe. Pandas or Dask DataFrame.

  • test_df (Union[DataFrame, DataFrame]) – Test dataframe. Pandas or Dask DataFrame.

  • target (str) – Name of target column. String.

  • cat_imputation (str) –

    The mode of imputation for categorical columns. If it equals to “constant”,

    then all missing values are filled with fill_val.

    If it equals to “most_frequent”,

    then all missing values are filled with most frequent value.

    If it equals to “drop”,

    then all categorical columns with missing values will be dropped.

  • cat_null_value (Optional[List[Any]]) – Specified categorical null values which should be recognized.

  • fill_val (str) –

    When cat_imputation = “constant”,

    then all missing values are filled with fill_val.

  • num_imputation (str) –

    The mode of imputation for numerical columns. If it equals to “mean”,

    then all missing values are filled with mean value.

    If it equals to “median”,

    then all missing values are filled with median value.

    If it equals to “most_frequent”,

    then all missing values are filled with most frequent value.

    If it equals to “drop”,

    then all numerical columns with missing values will be dropped.

  • num_null_value (Optional[List[Any]]) – Specified numerical null values which should be recognized.

  • cat_encoding (str) – The mode of encoding categorical columns. If it equals to “one_hot”, do one-hot encoding. If it equals to “no_encoding”, nothing will be done.

  • variance_threshold (bool) –

    If it is True,

    then dropping numerical columns with variance less than variance.

  • variance (float) – Variance value when variance_threshold = True.

  • num_scaling (str) – The mode of scaling for numerical columns. If it equals to “standardize”, do standardize for all numerical columns. If it equals to “minmax”, do minmax scaling for all numerical columns. If it equals to “maxabs”, do maxabs scaling for all numerical columns. If it equals to “no_scaling”, nothing will be done.

  • include_operators (Optional[List[str]]) – Components included for clean_ml, like “one_hot”, “standardize”, etc.

  • exclude_operators (Optional[List[str]]) – Components excluded for clean_ml, like “one_hot”, “standardize”, etc.

Return type

Tuple[DataFrame, DataFrame]

dataprep.clean.clean_ml.format_data_with_customized_cat(training_row, test_row, num_imputation='mean', num_null_value=None, variance_threshold=False, variance=0.0, num_scaling='standardize', include_operators=None, exclude_operators=None, customized_cat_pipeline=None)[source]

This function transforms an arbitrary tabular dataset into a format that’s suitable for a typical ML application. Customized categorical pipeline and related parameters should be provided by users

Parameters
  • training_row (Series) – One column of training dataset. Dask Series.

  • test_row (Series) – One column of test dataset. Dask Series.

  • num_imputation (str) –

    The mode of imputation for numerical columns. If it equals to “mean”,

    then all missing values are filled with mean value.

    If it equals to “median”,

    then all missing values are filled with median value.

    If it equals to “most_frequent”,

    then all missing values are filled with most frequent value.

    If it equals to “drop”,

    then all numerical columns with missing values will be dropped.

  • num_null_value (Optional[List[Any]]) – Specified numerical null values which should be recognized.

  • variance_threshold (bool) – If it is True, then dropping numerical columns with variance less than variance.

  • variance (float) – Variance value when variance_threshold = True.

  • num_scaling (str) – The mode of scaling for numerical columns. If it equals to “standardize”, do standardize for all numerical columns. If it equals to “minmax”, do minmax scaling for all numerical columns. If it equals to “maxabs”, do maxabs scaling for all numerical columns. If it equals to “no_scaling”, nothing will be done.

  • include_operators (Optional[List[str]]) – Components included for clean_ml, like “one_hot”, “standardize”, etc.

  • exclude_operators (Optional[List[str]]) – Components excluded for clean_ml, like “one_hot”, “standardize”, etc.

  • customized_cat_pipeline (Optional[List[Dict[str, Any]]]) – User-specified pipeline managing categorical columns.

Return type

Tuple[Series, Series]

dataprep.clean.clean_ml.format_data_with_customized_cat_and_num(training_row, test_row, include_operators=None, exclude_operators=None, customized_cat_pipeline=None, customized_num_pipeline=None)[source]

This function transforms an arbitrary tabular dataset into a format that’s suitable for a typical ML application. Both customized pipeline managing categorical columns and numerical columns should be provided.

Parameters
  • training_row (Series) – One column of training dataset. Dask Series.

  • test_row (Series) – One column of test dataset. Dask Series.

  • include_operators (Optional[List[str]]) – Components included for clean_ml, like “one_hot”, “standardize”, etc.

  • exclude_operators (Optional[List[str]]) – Components excluded for clean_ml, like “one_hot”, “standardize”, etc.

  • customized_cat_pipeline (Optional[List[Dict[str, Any]]]) – User-specified pipeline managing categorical columns.

  • customized_num_pipeline (Optional[List[Dict[str, Any]]]) – User-specified pipeline managing numerical columns.

Return type

Tuple[Series, Series]

dataprep.clean.clean_ml.format_data_with_customized_num(training_row, test_row, cat_imputation='constant', cat_null_value=None, fill_val='missing_value', cat_encoding='one_hot', include_operators=None, exclude_operators=None, customized_num_pipeline=None)[source]

This function transforms an arbitrary tabular dataset into a format that’s suitable for a typical ML application. Customized numerical pipeline and related parameters should be provided by users

Parameters
  • training_row (Series) – One column of training dataset. Dask Series.

  • test_row (Series) – One column of test dataset. Dask Series.

  • cat_imputation (str) –

    The mode of imputation for categorical columns. If it equals to “constant”,

    then all missing values are filled with fill_val.

    If it equals to “most_frequent”,

    then all missing values are filled with most frequent value.

    If it equals to “drop”,

    then all categorical columns with missing values will be dropped.

  • cat_null_value (Optional[List[Any]]) – Specified categorical null values which should be recognized.

  • fill_val (str) – When cat_imputation = “constant”, then all missing values are filled with fill_val.

  • cat_encoding (str) – The mode of encoding categorical columns. If it equals to “one_hot”, do one-hot encoding. If it equals to “no_encoding”, nothing will be done.

  • include_operators (Optional[List[str]]) – Components included for clean_ml, like “one_hot”, “standardize”, etc.

  • exclude_operators (Optional[List[str]]) – Components excluded for clean_ml, like “one_hot”, “standardize”, etc.

  • customized_num_pipeline (Optional[List[Dict[str, Any]]]) – User-specified pipeline managing numerical columns.

Return type

Tuple[Series, Series]

dataprep.clean.clean_ml.format_data_with_default(training_row, test_row, cat_imputation='constant', cat_null_value=None, fill_val='missing_value', num_imputation='mean', num_null_value=None, cat_encoding='one_hot', variance_threshold=True, variance=0.0, num_scaling='standardize', include_operators=None, exclude_operators=None)[source]

This function transforms an arbitrary tabular dataset into a format that’s suitable for a typical ML application. No customized pipeline should be provided. Use default pipeline.

Parameters
  • training_row (Series) – One column of training dataset. Dask Series.

  • test_row (Series) – One column of test dataset. Dask Series.

  • cat_imputation (str) –

    The mode of imputation for categorical columns. If it equals to “constant”,

    then all missing values are filled with fill_val.

    If it equals to “most_frequent”,

    then all missing values are filled with most frequent value.

    If it equals to “drop”,

    then all categorical columns with missing values will be dropped.

  • cat_null_value (Optional[List[Any]]) – Specified categorical null values which should be recognized.

  • fill_val (str) – When cat_imputation = “constant”, then all missing values are filled with fill_val.

  • num_imputation (str) –

    The mode of imputation for numerical columns. If it equals to “mean”,

    then all missing values are filled with mean value.

    If it equals to “median”,

    then all missing values are filled with median value.

    If it equals to “most_frequent”,

    then all missing values are filled with most frequent value.

    If it equals to “drop”,

    then all numerical columns with missing values will be dropped.

  • num_null_value (Optional[List[Any]]) – Specified numerical null values which should be recognized.

  • cat_encoding (str) – The mode of encoding categorical columns. If it equals to “one_hot”, do one-hot encoding. If it equals to “no_encoding”, nothing will be done.

  • variance_threshold (bool) – If it is True, then dropping numerical columns with variance less than variance.

  • variance (float) – Variance value when variance_threshold = True.

  • num_scaling (str) – The mode of scaling for numerical columns. If it equals to “standardize”, do standardize for all numerical columns. If it equals to “minmax”, do minmax scaling for all numerical columns. If it equals to “maxabs”, do maxabs scaling for all numerical columns. If it equals to “no_scaling”, nothing will be done.

  • include_operators (Optional[List[str]]) – Components included for clean_ml, like “one_hot”, “standardize”, etc.

  • exclude_operators (Optional[List[str]]) – Components excluded for clean_ml, like “one_hot”, “standardize”, etc.

Return type

Tuple[Series, Series]

ISBN Numbers

Clean and validate a DataFrame column containing ISBN numbers.

dataprep.clean.clean_isbn.clean_isbn(df, column, output_format='standard', split=False, inplace=False, errors='coerce', progress=True)[source]

Clean ISBN type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • column (str) – The name of the column containing data of ISBN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators. If output_format = ‘standard’, return string with proper separators. If output_format = ‘isbn13’, return ISBN string with 13 digits. If output_format = ‘isbn10’, return ISBN string with 10 digits.

    (default: “standard”)

  • split (bool) –

    If True,

    each component of derived from its number string will be put into its own column.

    (default: False)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of ISBN data.

>>> df = pd.DataFrame({{
        "isbn": [
        "978-9024538270",
        "978-9024538271"]
        })
>>> clean_isbn(df, 'isbn', inplace=True)
       isbn_clean
0  978-90-245-3827-0
1         NaN
Return type

DataFrame

dataprep.clean.clean_isbn.validate_isbn(df, column='')[source]

Validate if a data cell is ISBN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • column (str) – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Australian Business Numbers

Clean and validate a DataFrame column containing Australian Business Numbers (ABNs).

dataprep.clean.clean_au_abn.clean_au_abn(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Australian Business Numbers (ABNs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of ABN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of ABN data.

>>> df = pd.DataFrame({{
        "abn": [
        "51824753556",
        "99999999999",]
        })
>>> clean_au_abn(df, 'abn')
        abn                 abn_clean
0       51824753556         51 824 753 556
1       99999999999         NaN
Return type

DataFrame

dataprep.clean.clean_au_abn.validate_au_abn(df, column='')[source]

Validate if a data cell is ABN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Australian Company Numbers

Clean and validate a DataFrame column containing Australian Company Numbers (ACNs).

dataprep.clean.clean_au_acn.clean_au_acn(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Australian Company Numbers (ACNs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of ACN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘abn’, convert the number to an Australian Business Number (ABN).

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of ACN data.

>>> df = pd.DataFrame({{
        "acn": [
        "004085616",
        "999 999 999"]
        })
>>> clean_au_acn(df, 'acn')
        acn             acn_clean
0       004085616       004 085 616
1       999 999 999     NaN
Return type

DataFrame

dataprep.clean.clean_au_acn.validate_au_acn(df, column='')[source]

Validate if a data cell is ACN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Australian Tax File Numbers

Clean and validate a DataFrame column containing Australian Tax File Numbers (TFNs).

dataprep.clean.clean_au_tfn.clean_au_tfn(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Australian Tax File Numbers (TFNs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of TFN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of TFN data.

>>> df = pd.DataFrame({
        "tfn": [
        "123456782",
        "999 999 999"]
        })
>>> clean_au_tfn(df, 'tfn')
        tfn             tfn_clean
0       123456782       123 456 782
1       999 999 999     NaN
Return type

DataFrame

dataprep.clean.clean_au_tfn.validate_au_tfn(df, column='')[source]

Validate if a data cell is TFN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Belgian IBAN Numbers

Clean and validate a DataFrame column containing Belgian IBANs.

dataprep.clean.clean_be_iban.clean_be_iban(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Belgian IBAN (International Bank Account Number) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of Belgian IBAN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘bic’, return the BIC for the bank that this number refers to.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of Belgian IBANs data.

>>> df = pd.DataFrame({{
        "be_iban": [
        "BE32 123-4567890-02",
        "BE41091811735141"]
        })
>>> clean_be_iban(df, 'be_iban')
        be_iban                 be_iban_clean
0       BE32 123-4567890-02     BE32123456789002
1       BE41091811735141        NaN
Return type

DataFrame

dataprep.clean.clean_be_iban.validate_be_iban(df, column='')[source]

Validate if a data cell is Belgian IBAN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Belgian VAT Numbers

Clean and validate a DataFrame column containing Belgian VAT numbers (VATs).

dataprep.clean.clean_be_vat.clean_be_vat(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Belgian VAT numbers (VATs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of VAT type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of VAT, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of VAT data.

>>> df = pd.DataFrame({{
        "vat": [
        'BE403019261',
        'BE431150351',]
        })
>>> clean_be_vat(df, 'vat')
        vat             vat_clean
0       BE403019261     0403019261
1       BE431150351     NaN
Return type

DataFrame

dataprep.clean.clean_be_vat.validate_be_vat(df, column='')[source]

Validate if a data cell is VAT in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Bulgarian National Identification Numbers

Clean and validate a DataFrame column containing Bulgarian national identification numbers (EGNs).

dataprep.clean.clean_bg_egn.clean_bg_egn(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Bulgarian national identification numbers (EGNs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of EGN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, return the birth date attained from the number. Note: in the case of EGN, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of EGN data.

>>> df = pd.DataFrame({{
        "egn": [
        '752316 926 3',
        '7552A10004']
        })
>>> clean_bg_egn(df, 'egn')
        egn             egn_clean
0       752316 926 3    7523169263
1       7552A10004      NaN
Return type

DataFrame

dataprep.clean.clean_bg_egn.validate_bg_egn(df, column='')[source]

Validate if a data cell is EGN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Bulgarian VAT Numbers

Clean and validate a DataFrame column containing Bulgarian VAT numbers (VATs).

dataprep.clean.clean_bg_vat.clean_bg_vat(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Bulgarian VAT numbers (VATs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of VAT type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of VAT, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of VAT data.

>>> df = pd.DataFrame({{
        "vat": [
        'BG 175 074 752',
        '175074752',
        '175074751']
        })
>>> clean_bg_vat(df, 'vat')
        vat             vat_clean
0       BG 175 074 752  175074752
1       175074752       175074752
2       175074751       NaN
Return type

DataFrame

dataprep.clean.clean_bg_vat.validate_bg_vat(df, column='')[source]

Validate if a data cell is VAT in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Belarusian UNP Numbers

Clean and validate a DataFrame column containing Belarusian UNP numbers (UNPs).

dataprep.clean.clean_by_unp.clean_by_unp(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Belarusian UNP numbers (UNPs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of UNP type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of UNP, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of UNP data.

>>> df = pd.DataFrame({{
        "unp": [
        '200988541',
        'УНП MA1953684',
        '200988542']
        })
>>> clean_by_unp(df, 'unp')
        unp             unp_clean
0       200988541       200988541
1       УНП MA1953684   MA1953684
2       200988542       NaN
Return type

DataFrame

dataprep.clean.clean_by_unp.validate_by_unp(df, column='')[source]

Validate if a data cell is UNP in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Canadian Business Numbers

Clean and validate a DataFrame column containing Canadian Business Numbers (BNs).

dataprep.clean.clean_ca_bn.clean_ca_bn(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Canadian Business Numbers (BNs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of BN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of BN, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of BN data.

>>> df = pd.DataFrame({{
        "bn": [
        '12302 6635',
        '12302 6635 RC 0001'
        '12345678Z']
        })
>>> clean_ca_bn(df, 'bn')
        bn                      bn_clean
0       12302 6635              123026635
1       12302 6635 RC 0001      123026635RC0001
2       12345678Z               NaN
Return type

DataFrame

dataprep.clean.clean_ca_bn.validate_ca_bn(df, column='')[source]

Validate if a data cell is BN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Swiss Einzahlungsschein MIT Referenznummers

Clean and validate a DataFrame column containing Swiss EinzahlungsSchein mit Referenznummer (ESRs).

dataprep.clean.clean_ch_esr.clean_ch_esr(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Swiss EinzahlungsSchein mit Referenznummer (ESRs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of ESR type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of ESR data.

>>> df = pd.DataFrame({{
        "esr": [
        "18 78583",
        "210000000003139471430009016"]
        })
>>> clean_ch_esr(df, 'esr')
        esr                             esr_clean
0       18 78583                        00 00000 00000 00000 00018 78583
1       210000000003139471430009016     NaN
Return type

DataFrame

dataprep.clean.clean_ch_esr.validate_ch_esr(df, column='')[source]

Validate if a data cell is ESR in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Swiss Social Security Numbers

Clean and validate a DataFrame column containing Swiss social security numbers (SSNs).

dataprep.clean.clean_ch_ssn.clean_ch_ssn(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Swiss social security numbers (SSNs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of SSN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of SSN data.

>>> df = pd.DataFrame({{
        "ssn": [
        '7569217076985',
        '756.9217.0769.84',]
        })
>>> clean_ch_ssn(df, 'ssn')
        ssn                      ssn_clean
0       7569217076985            756.9217.0769.85
1       756.9217.0769.84         NaN
Return type

DataFrame

dataprep.clean.clean_ch_ssn.validate_ch_ssn(df, column='')[source]

Validate if a data cell is SSN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Swiss Business Identifiers

Clean and validate a DataFrame column containing Swiss business identifiers (UIDs).

dataprep.clean.clean_ch_uid.clean_ch_uid(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Swiss business identifiers (UIDs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of UID type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of UID data.

>>> df = pd.DataFrame({{
        "uid": [
        'CHE100155212',
        'CHE-100.155.213',]
        })
>>> clean_ch_uid(df, 'uid')
        uid                      uid_clean
0       CHE100155212             CHE-100.155.212
1       CHE-100.155.213          NaN
Return type

DataFrame

dataprep.clean.clean_ch_uid.validate_ch_uid(df, column='')[source]

Validate if a data cell is UID in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Swiss VAT Numbers

Clean and validate a DataFrame column containing Swiss VAT numbers (VATs).

dataprep.clean.clean_ch_vat.clean_ch_vat(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Swiss VAT numbers (VATs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of VAT type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of VAT data.

>>> df = pd.DataFrame({{
        "vat": [
        'CHE107787577IVA',
        'CHE-107.787.578 IVA',]
        })
>>> clean_ch_vat(df, 'vat')
        vat                      vat_clean
0       CHE107787577IVA          CHE-107.787.577 IVA
1       CHE-107.787.578 IVA      NaN
Return type

DataFrame

dataprep.clean.clean_ch_vat.validate_ch_vat(df, column='')[source]

Validate if a data cell is VAT in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Chile RUT/RUN Numbers

Clean and validate a DataFrame column containing Chile RUT/RUN numbers (RUTs).

dataprep.clean.clean_cl_rut.clean_cl_rut(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Chile RUT/RUN numbers (RUTs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of RUT type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of RUT data.

>>> df = pd.DataFrame({{
        "rut": [
        "125319092",
        "76086A28-5"]
        })
>>> clean_cl_rut(df, 'rut')
        rut               rut_clean
0       125319092         12.531.909-2
1       76086A28-5        NaN
Return type

DataFrame

dataprep.clean.clean_cl_rut.validate_cl_rut(df, column='')[source]

Validate if a data cell is RUT in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Chinese Resident Identity Card Numbers

Clean and validate a DataFrame column containing Chinese Resident Identity Card Number (RICs).

dataprep.clean.clean_cn_ric.clean_cn_ric(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Chinese Resident Identity Card Number (RICs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of RIC type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, return the birth date of the person. If output_format = ‘birthplace’, return the place of birth of the person. Note: in the case of RIC, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of RIC data.

>>> df = pd.DataFrame({{
        "ric": [
        "360426199101010071",
        "99999999999"]
        })
>>> clean_cn_ric(df, 'ric')
        ric                     ric_clean
0       360426199101010071      51 824 753 556
1       99999999999         NaN
Return type

DataFrame

dataprep.clean.clean_cn_ric.validate_cn_ric(df, column='')[source]

Validate if a data cell is RIC in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Colombian Identity Codes

Clean and validate a DataFrame column containing Colombian identity codes (NITs).

dataprep.clean.clean_co_nit.clean_co_nit(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Colombian identity codes (NITs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of NIT type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of NIT data.

>>> df = pd.DataFrame({{
        "nit": [
        "2131234321",
        "2131234325"]
        })
>>> clean_co_nit(df, 'nit')
        nit                nit_clean
0       2131234321         213.123.432-1
1       2131234325         NaN
Return type

DataFrame

dataprep.clean.clean_co_nit.validate_co_nit(df, column='')[source]

Validate if a data cell is NIT in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Costa Rica Physical Person ID Numbers

Clean and validate a DataFrame column containing Costa Rica physical person ID number (CPFs).

dataprep.clean.clean_cr_cpf.clean_cr_cpf(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Costa Rica physical person ID number (CPFs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of CPF type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of CPF data.

>>> df = pd.DataFrame({{
        "cpf": [
        "1-613-584",
        "30-1234-1234"]
        })
>>> clean_cr_cpf(df, 'cpf')
        cpf              cpf_clean
0       1-613-584        01-0613-0584
1       30-1234-1234     NaN
Return type

DataFrame

dataprep.clean.clean_cr_cpf.validate_cr_cpf(df, column='')[source]

Validate if a data cell is CPF in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Costa Rica Tax Numbers

Clean and validate a DataFrame column containing Costa Rica tax number (CPJs).

dataprep.clean.clean_cr_cpj.clean_cr_cpj(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Costa Rica tax number (CPJs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of CPJ type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of CPJ data.

>>> df = pd.DataFrame({{
        "cpj": [
        "4 000 042138",
        "3-534-123559"]
        })
>>> clean_cr_cpj(df, 'cpj')
        cpj                 cpj_clean
0       4 000 042138        4-000-042138
1       3-534-123559        NaN
Return type

DataFrame

dataprep.clean.clean_cr_cpj.validate_cr_cpj(df, column='')[source]

Validate if a data cell is CPJ in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Costa Rica Foreigners ID Numbers

Clean and validate a DataFrame column containing Costa Rica foreigners ID number (CRs).

dataprep.clean.clean_cr_cr.clean_cr_cr(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Costa Rica foreigners ID number (CRs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of CR type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of CR, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of CR data.

>>> df = pd.DataFrame({{
        "cr": [
        '122200569906',
        '12345678',]
        })
>>> clean_cr_cr(df, 'cr')
        cr                      cr_clean
0       122200569906            122200569906
1       12345678                NaN
Return type

DataFrame

dataprep.clean.clean_cr_cr.validate_cr_cr(df, column='')[source]

Validate if a data cell is CR in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Cuban Identity Card Numbers

Clean and validate a DataFrame column containing Cuban identity card numbers (NIs).

dataprep.clean.clean_cu_ni.clean_cu_ni(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Cuban identity card numbers (NIs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of NI type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, return the date of birth. If output_format = ‘gender’, return the gender (M/F) from the person’s NI. Note: in the case of NI, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of NI data.

>>> df = pd.DataFrame({{
        "ni": [
        '91021027775',
        '9102102777A',]
        })
>>> clean_cu_ni(df, 'ni')
        ni                      ni_clean
0       91021027775             91021027775
1       9102102777A             NaN
Return type

DataFrame

dataprep.clean.clean_cu_ni.validate_cu_ni(df, column='')[source]

Validate if a data cell is NI in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Cypriot VAT Numbers

Clean and validate a DataFrame column containing Cypriot VAT number (VATs).

dataprep.clean.clean_cy_vat.clean_cy_vat(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Cypriot VAT number (VATs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of VAT type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of VAT, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of VAT data.

>>> df = pd.DataFrame({{
        "vat": [
        'CY-10259033P',
        'CY-10259033Z',]
        })
>>> clean_cy_vat(df, 'vat')
        vat                      vat_clean
0       CY-10259033P             10259033P
1       CY-10259033Z             NaN
Return type

DataFrame

dataprep.clean.clean_cy_vat.validate_cy_vat(df, column='')[source]

Validate if a data cell is VAT in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Czech VAT Numbers

Clean and validate a DataFrame column containing Czech VAT number (DICs).

dataprep.clean.clean_cz_dic.clean_cz_dic(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Czech VAT number (DICs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of DIC type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of DIC, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of DIC data.

>>> df = pd.DataFrame({{
        "dic": [
        'CZ 25123891',
        '25123890',]
        })
>>> clean_cz_dic(df, 'dic')
        dic                  dic_clean
0       CZ 25123891          25123891
1       25123890             NaN
Return type

DataFrame

dataprep.clean.clean_cz_dic.validate_cz_dic(df, column='')[source]

Validate if a data cell is DIC in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Czech Birth Numbers

Clean and validate a DataFrame column containing Czech birth numbers (RCs).

dataprep.clean.clean_cz_rc.clean_cz_rc(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Czech birth numbers (RCs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of RC type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of RC data.

>>> df = pd.DataFrame({{
        "rc": [
        "7103192745",
        "7103192746"]
        })
>>> clean_cz_rc(df, 'rc')
        rc                 rc_clean
0       7103192745         710319/2745
1       7103192746         NaN
Return type

DataFrame

dataprep.clean.clean_cz_rc.validate_cz_rc(df, column='')[source]

Validate if a data cell is RC in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

German Company Registry IDs

Clean and validate a DataFrame column containing German company registry id (handelsregisternummer).

dataprep.clean.clean_de_handelsregisternummer.clean_de_handelsregisternummer(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean German company registry id type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of handelsregisternummer type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of handelsregisternummer, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of handelsregisternummer data.

>>> df = pd.DataFrame({{
        "handelsregisternummer": [
        'Aachen HRA 11223',
        'Aachen HRC 44123',]
        })
>>> clean_de_handelsregisternummer(df, 'handelsregisternummer')
        handelsregisternummer   handelsregisternummer_clean
0       Aachen HRA 11223        Aachen HRA 11223
1       Aachen HRC 44123        NaN
Return type

DataFrame

dataprep.clean.clean_de_handelsregisternummer.validate_de_handelsregisternummer(df, column='')[source]

Validate if a data cell is handelsregisternummer in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

German Personal Tax Numbers

Clean and validate a DataFrame column containing German personal tax number (IDNRs).

dataprep.clean.clean_de_idnr.clean_de_idnr(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean German personal tax number (IDNRs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of IDNR type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of IDNR data.

>>> df = pd.DataFrame({{
        "idnr": [
        "36574261809",
        "36554266806"]
        })
>>> clean_de_idnr(df, 'idnr')
        idnr                idnr_clean
0       36574261809         36 574 261 809
1       36554266806         NaN
Return type

DataFrame

dataprep.clean.clean_de_idnr.validate_de_idnr(df, column='')[source]

Validate if a data cell is IDNR in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

German Tax Numbers

Clean and validate a DataFrame column containing German tax numbers (STNRs).

dataprep.clean.clean_de_stnr.clean_de_stnr(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean German tax numbers (STNRs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of STNR type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of STNR data.

>>> df = pd.DataFrame({{
        "stnr": [
        "181/815/0815 5",
        "136695978"]
        })
>>> clean_de_stnr(df, 'stnr')
        stnr                 stnr_clean
0       181/815/0815 5       181/815/08155
1       136695978            NaN
Return type

DataFrame

dataprep.clean.clean_de_stnr.validate_de_stnr(df, column='', region=None)[source]

Validate if a data cell is STNR in a DataFrame column. For each cell, return True or False. The region can be supplied to verify that the number is assigned in that region.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

  • region (Optional[str]) –

    Specify the region that the number belongs to.

    (default: None)

Return type

Union[bool, Series, DataFrame]

German VAT Numbers

Clean and validate a DataFrame column containing German VAT numbers (VATs).

dataprep.clean.clean_de_vat.clean_de_vat(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean German VAT numberss (VATs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of VAT type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of VAT, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of VAT data.

>>> df = pd.DataFrame({{
        "vat": [
        'DE 136,695 976',
        '136695978']
        })
>>> clean_de_vat(df, 'vat')
        vat                 vat_clean
0       DE 136,695 976      136695976
1       136695978           NaN
Return type

DataFrame

dataprep.clean.clean_de_vat.validate_de_vat(df, column='')[source]

Validate if a data cell is VAT in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

German Securities Identification Codes

Clean and validate a DataFrame column containing German Securities Identification Codes (WKNs).

dataprep.clean.clean_de_wkn.clean_de_wkn(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Wertpapierkennnummer (WKNs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of WKN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘isin’, convert the number to an ISIN. Note: in the case of WKN, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of WKN data.

>>> df = pd.DataFrame({{
        "wkn": [
        'A0MNRK',
        'AOMNRK']
        })
>>> clean_de_wkn(df, 'wkn')
        wkn             wkn_clean
0       A0MNRK          A0MNRK
1       AOMNRK          NaN
Return type

DataFrame

dataprep.clean.clean_de_wkn.validate_de_wkn(df, column='')[source]

Validate if a data cell is WKN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Danish Citizen Numbers

Clean and validate a DataFrame column containing Danish citizen number (CPRs).

dataprep.clean.clean_dk_cpr.clean_dk_cpr(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Danish citizen number (CPRs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of CPR type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, split the number and return the birth date.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of CPR data.

>>> df = pd.DataFrame({{
        "cpr": [
        "2110625629",
        "511062-5629"]
        })
>>> clean_dk_cpr(df, 'cpr')
        cpr                 cpr_clean
0       2110625629          211062-5629
1       511062-5629         NaN
Return type

DataFrame

dataprep.clean.clean_dk_cpr.validate_dk_cpr(df, column='')[source]

Validate if a data cell is CPR in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Danish CVR Numbers

Clean and validate a DataFrame column containing Danish CVR number (CVRs).

dataprep.clean.clean_dk_cvr.clean_dk_cvr(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Danish CVR number (CVRs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of CVR type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of CVR, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of CVR data.

>>> df = pd.DataFrame({{
        "cvr": [
        'DK 13585628',
        'DK 13585627']
        })
>>> clean_dk_cvr(df, 'cvr')
        cvr             cvr_clean
0       DK 13585628     13585628
1       DK 13585627     NaN
Return type

DataFrame

dataprep.clean.clean_dk_cvr.validate_dk_cvr(df, column='')[source]

Validate if a data cell is CVR in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Dominican Republic National Identifiers

Clean and validate a DataFrame column containing Dominican Republic national identifier (Cedulas).

dataprep.clean.clean_do_cedula.clean_do_cedula(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Dominican Republic national identifier (Cedulas) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of Cedula type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of Cedula data.

>>> df = pd.DataFrame({{
        "cedula": [
        "22400022111",
        "0011391820A"]
        })
>>> clean_do_cedula(df, 'cedula')
        cedula              cedula_clean
0       22400022111         224-0002211-1
1       0011391820A         NaN
Return type

DataFrame

dataprep.clean.clean_do_cedula.validate_do_cedula(df, column='')[source]

Validate if a data cell is Cedula in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Dominican Republic Invoice Numbers

Clean and validate a DataFrame column containing Dominican Republic invoice numbers (NCFs).

dataprep.clean.clean_do_ncf.clean_do_ncf(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Dominican Republic invoice numbers (NCFs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of NCF type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of NCF, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of NCF data.

>>> df = pd.DataFrame({{
        "ncf": [
        'E310000000005',
        'Z0100000005',]
        })
>>> clean_do_ncf(df, 'ncf')
        ncf                 ncf_clean
0       E310000000005       E310000000005
1       Z0100000005         NaN
Return type

DataFrame

dataprep.clean.clean_do_ncf.validate_do_ncf(df, column='')[source]

Validate if a data cell is NCF in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Dominican Republic Tax Registrations

Clean and validate a DataFrame column containing Dominican Republic tax registration (RNCs).

dataprep.clean.clean_do_rnc.clean_do_rnc(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Dominican Republic tax registration (RNCs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of RNC type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of RNC data.

>>> df = pd.DataFrame({{
        "rnc": [
        "131246796",
        "1018A0043"]
        })
>>> clean_do_rnc(df, 'rnc')
        rnc               rnc_clean
0       131246796         1-31-24679-6
1       1018A0043         NaN
Return type

DataFrame

dataprep.clean.clean_do_rnc.validate_do_rnc(df, column='')[source]

Validate if a data cell is RNC in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Ecuadorian Personal Identity Codes

Clean and validate a DataFrame column containing Ecuadorian personal identity codes (CIs).

dataprep.clean.clean_ec_ci.clean_ec_ci(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Ecuadorian personal identity codes (CIs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of CI type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of CI, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of CI data.

>>> df = pd.DataFrame({{
        "ci": [
        '171430710-3',
        'BE431150351']
        })
>>> clean_ec_ci(df, 'ci')
        ci             ci_clean
0       171430710-3    1714307103
1       1714307104     NaN
Return type

DataFrame

dataprep.clean.clean_ec_ci.validate_ec_ci(df, column='')[source]

Validate if a data cell is CI in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Ecuadorian Company Tax Numbers

Clean and validate a DataFrame column containing Ecuadorian company tax number (RUCs).

dataprep.clean.clean_ec_ruc.clean_ec_ruc(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Ecuadorian company tax number (RUCs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of RUC type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of RUC, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of RUC data.

>>> df = pd.DataFrame({{
        "ruc": [
        '1792060346-001',
        '1763154690001']
        })
>>> clean_ec_ruc(df, 'ruc')
        ruc                 ruc_clean
0       1792060346-001      1792060346001
1       1763154690001       NaN
Return type

DataFrame

dataprep.clean.clean_ec_ruc.validate_ec_ruc(df, column='')[source]

Validate if a data cell is RUC in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Estonian Personcal ID Numbers

Clean and validate a DataFrame column containing Estonian Personcal ID numbers (IKs).

dataprep.clean.clean_ee_ik.clean_ee_ik(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Estonian Personcal ID number (IKs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of IK type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, get the person’s birthdate. If output_format = ‘gender’, get the person’s birth gender (‘M’ or ‘F’). Note: in the case of IK, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of IK data.

>>> df = pd.DataFrame({
        "ik": [
        '36805280109',
        '36805280108']
        })
>>> clean_ee_ik(df, 'ik')
        ik              ik_clean
0       36805280109     36805280109
1       36805280108     NaN
Return type

DataFrame

dataprep.clean.clean_ee_ik.validate_ee_ik(df, column='')[source]

Validate if a data cell is IK in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Estonian KMKR Numbers

Clean and validate a DataFrame column containing Estonian KMKR numbers (KMKRs).

dataprep.clean.clean_ee_kmkr.clean_ee_kmkr(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Estonian KMKR numbers (KMKRs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of KMKR type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of KMKR, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of KMKR data.

>>> df = pd.DataFrame({{
        "kmkr": [
        'EE 100 931 558',
        '100594103']
        })
>>> clean_ee_kmkr(df, 'kmkr')
        kmkr                kmkr_clean
0       EE 100 931 558      100931558
1       100594103           NaN
Return type

DataFrame

dataprep.clean.clean_ee_kmkr.validate_ee_kmkr(df, column='')[source]

Validate if a data cell is KMKR in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Spanish Bank Account Codes

Clean and validate a DataFrame column containing Spanish Bank Account Codes (CCCs).

dataprep.clean.clean_es_ccc.clean_es_ccc(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Spanish Bank Account Codes (CCCs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of CCC type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘iban’, convert the number to an IBAN.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of CCC data.

>>> df = pd.DataFrame({{
        "ccc": [
        "12341234161234567890",
        "134-1234-16 1234567890"]
        })
>>> clean_es_ccc(df, 'ccc')
        ccc                         ccc_clean
0       12341234161234567890        1234 1234 16 12345 67890
1       134-1234-16 1234567890      NaN
Return type

DataFrame

dataprep.clean.clean_es_ccc.validate_es_ccc(df, column='')[source]

Validate if a data cell is CCC in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Spanish Fiscal Numbers

Clean and validate a DataFrame column containing Spanish fiscal numbers (CIFs).

dataprep.clean.clean_es_cif.clean_es_cif(df, column, output_format='standard', split=False, inplace=False, errors='coerce', progress=True)[source]

Clean Spanish fiscal numbers (CIFs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of CIF type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of CIF, the compact format is the same as the standard one.

    (default: “standard”)

  • split (bool) –

    If True,

    each component of derived from its number string will be put into its own column.

    (default: False)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of CIF data.

>>> df = pd.DataFrame({{
        "cif": [
        'A13 585 625',
        'M-1234567-L',]
        })
>>> clean_es_cif(df, 'cif')
        cif             cif_clean
0       A13 585 625     A13585625
1       M-1234567-L     NaN
Return type

DataFrame

dataprep.clean.clean_es_cif.validate_es_cif(df, column='')[source]

Validate if a data cell is CIF in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Spanish Meter Point Numbers

Clean and validate a DataFrame column containing Spanish meter point numbers (CUPSs).

dataprep.clean.clean_es_cups.clean_es_cups(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Spanish meter point numbers (CUPSs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of CUPS type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of CUPS data.

>>> df = pd.DataFrame({{
        "cups": [
        "ES1234123456789012JY1F",
        "ES 1234-123456789012-XY 1F",]
        })
>>> clean_es_cups(df, 'cups')
        cups                            cups_clean
0       ES1234123456789012JY1F          ES 1234 1234 5678 9012 JY 1F
1       ES 1234-123456789012-XY 1F      NaN
Return type

DataFrame

dataprep.clean.clean_es_cups.validate_es_cups(df, column='')[source]

Validate if a data cell is CUPS in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Spanish Personal Identity Codes

Clean and validate a DataFrame column containing Spanish personal identity codes (DNIs).

dataprep.clean.clean_es_dni.clean_es_dni(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Spanish personal identity codes (DNIs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of DNI type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of DNI, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of DNI data.

>>> df = pd.DataFrame({{
        "dni": [
        '54362315-K',
        '54362315']
        })
>>> clean_es_dni(df, 'dni')
        dni             dni_clean
0       54362315-K      54362315K
1       54362315        NaN
Return type

DataFrame

dataprep.clean.clean_es_dni.validate_es_dni(df, column='')[source]

Validate if a data cell is DNI in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Spanish IBANs

Clean and validate a DataFrame column containing Spanish IBANs (IBANs).

dataprep.clean.clean_es_iban.clean_es_iban(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Spanish IBANs (IBANs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of IBAN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘ccc’, return the CCC (Código Cuenta Corriente) part of the number.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of IBAN data.

>>> df = pd.DataFrame({{
        "iban": [
        "ES771234-1234-16 1234567890",
        "R1601101050000010547023795",]
        })
>>> clean_es_iban(df, 'iban')
        iban                            iban_clean
0       ES771234-1234-16 1234567890     ES77 1234 1234 1612 3456 7890
1       R1601101050000010547023795      NaN
Return type

DataFrame

dataprep.clean.clean_es_iban.validate_es_iban(df, column='')[source]

Validate if a data cell is IBAN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Spanish Foreigner Identity Codes

Clean and validate a DataFrame column containing Spanish foreigner identity codes (NIEs).

dataprep.clean.clean_es_nie.clean_es_nie(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Spanish foreigner identity codes (NIEs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of NIE type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of NIE, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of NIE data.

>>> df = pd.DataFrame({{
        "nie": [
        'x-2482300w',
        'x-2482300a']
        })
>>> clean_es_nie(df, 'nie')
        nie            nie_clean
0       x-2482300w     X2482300W
1       x-2482300a     NaN
Return type

DataFrame

dataprep.clean.clean_es_nie.validate_es_nie(df, column='')[source]

Validate if a data cell is NIE in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Spanish NIF Numbers

Clean and validate a DataFrame column containing Spanish NIF numbers (NIFs).

dataprep.clean.clean_es_nif.clean_es_nif(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Spanish NIF numbers (NIFs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of NIF type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of NIF, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of NIF data.

>>> df = pd.DataFrame({{
        "nif": [
        'ES B-58378431',
        'B64717839']
        })
>>> clean_es_nif(df, 'nif')
        nif                 nif_clean
0       ES B-58378431       B58378431
1       B64717839           NaN
Return type

DataFrame

dataprep.clean.clean_es_nif.validate_es_nif(df, column='')[source]

Validate if a data cell is NIF in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Classification For Businesses In The European Union

Clean and validate a DataFrame column containing classification for businesses in the European Union (NACE).

dataprep.clean.clean_eu_nace.clean_eu_nace(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean classification for businesses in the European Union type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of NACE type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘label’, return the category label for the number.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of NACE data.

>>> df = pd.DataFrame({{
        "nace": [
        "6201",
        "99999999999"]
        })
>>> clean_eu_nace(df, 'nace')
        nace         nace_clean
0       6201         62.01
1       62059        NaN
Return type

DataFrame

dataprep.clean.clean_eu_nace.validate_eu_nace(df, column='')[source]

Validate if a data cell is NACE in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

European VAT Numbers

Clean and validate a DataFrame column containing European VAT numbers (VATs).

dataprep.clean.clean_eu_vat.clean_eu_vat(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean European VAT numbers (VATs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of VAT type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘country’, guess the country code based on the number and

    return the list of valid and lower case codes.

    Note: in the case of VAT, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of VAT data.

>>> df = pd.DataFrame({{
        "vat": [
        'ATU 57194903',
        'FR 61 954 506 077']
        })
>>> clean_eu_vat(df, 'vat')
        vat                 vat_clean
0       ATU 57194903        ATU57194903
1       FR 61 954 506 077   FR61954506077
Return type

DataFrame

dataprep.clean.clean_eu_vat.validate_eu_vat(df, column='')[source]

Validate if a data cell is VAT in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Finnish ALV Numbers

Clean and validate a DataFrame column containing Finnish ALV numbers (ALVs).

dataprep.clean.clean_fi_alv.clean_fi_alv(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Finnish ALV numbers (ALVs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of ALV type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of ALV, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of ALV data.

>>> df = pd.DataFrame({{
        "alv": [
        'FI 20774740',
        'FI 20774741']
        })
>>> clean_fi_alv(df, 'alv')
        alv             alv_clean
0       FI 20774740     20774740
1       FI 20774741     NaN
Return type

DataFrame

dataprep.clean.clean_fi_alv.validate_fi_alv(df, column='')[source]

Validate if a data cell is ALV in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Finnish Personal Identity Codes

Clean and validate a DataFrame column containing Finnish personal identity codes (HETUs).

dataprep.clean.clean_fi_hetu.clean_fi_hetu(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Finnish personal identity codes (HETUs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of HETU type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of HETU, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of HETU data.

>>> df = pd.DataFrame({{
        "hetu": [
        '131052a308t',
        '131052-308U']
        })
>>> clean_fi_hetu(df, 'hetu')
        hetu             hetu_clean
0       131052a308t      131052A308T
1       131052-308U      NaN
Return type

DataFrame

dataprep.clean.clean_fi_hetu.validate_fi_hetu(df, column='')[source]

Validate if a data cell is HETU in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Finnish Business Identifiers

Clean and validate a DataFrame column containing Finnish business identifiers (y-tunnus).

dataprep.clean.clean_fi_ytunnus.clean_fi_ytunnus(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Finnish business identifiers (y-tunnus) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of y-tunnus type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of y-tunnus data.

>>> df = pd.DataFrame({{
        "ytunnus": [
        "20774740",
        "2077474-1",]
        })
>>> clean_fi_ytunnus(df, 'ytunnus')
        ytunnus          ytunnus_clean
0       20774740         2077474-0
1       2077474-1        NaN
Return type

DataFrame

dataprep.clean.clean_fi_ytunnus.validate_fi_ytunnus(df, column='')[source]

Validate if a data cell is y-tunnus in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

French Tax Identification Numbers

Clean and validate a DataFrame column containing French tax identification numbers (NIFs).

dataprep.clean.clean_fr_nif.clean_fr_nif(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean French tax identification numbers (NIFs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of NIF type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of NIF data.

>>> df = pd.DataFrame({{
        "nif": [
        "0701987765432",
        "070198776543"]
        })
>>> clean_fr_nif(df, 'nif')
        nif                 nif_clean
0       0701987765432       07 01 987 765 432
1       070198776543        NaN
Return type

DataFrame

dataprep.clean.clean_fr_nif.validate_fr_nif(df, column='')[source]

Validate if a data cell is NIF in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

French Personal Identification Numbers

Clean and validate a DataFrame column containing French personal identification numbers (NIRs).

dataprep.clean.clean_fr_nir.clean_fr_nir(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean French personal identification numbers (NIRs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of NIR type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of NIR data.

>>> df = pd.DataFrame({{
        "nir": [
        "295109912611193",
        "253072C07300443"]
        })
>>> clean_fr_nir(df, 'nir')
        nir                     nir_clean
0       295109912611193         2 95 10 99 126 111 93
1       253072C07300443         NaN
Return type

DataFrame

dataprep.clean.clean_fr_nir.validate_fr_nir(df, column='')[source]

Validate if a data cell is NIR in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

French Company Identification Numbers

Clean and validate a DataFrame column containing French company identification numbers (SIRENs).

dataprep.clean.clean_fr_siren.clean_fr_siren(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean French company identification numbers (SIRENs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of SIREN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘tva’, return a TVA that preposes two extra check digits to the data. Note: in the case of SIREN, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of SIREN data.

>>> df = pd.DataFrame({{
        "siren": [
        '552 008 443',
        '404833047']
        })
>>> clean_fr_siren(df, 'siren')
        siren           siren_clean
0       552 008 443     552008443
1       404833047       NaN
Return type

DataFrame

dataprep.clean.clean_fr_siren.validate_fr_siren(df, column='')[source]

Validate if a data cell is SIREN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

French TVA Numbers

Clean and validate a DataFrame column containing French TVA numbers (TVAs).

dataprep.clean.clean_fr_tva.clean_fr_tva(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean French TVA numbers (TVAs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of TVA type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of TVA, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of TVA data.

>>> df = pd.DataFrame({{
        "tva": [
        'Fr 40 303 265 045',
        '84 323 140 391']
        })
>>> clean_fr_tva(df, 'tva')
        tva                     tva_clean
0       Fr 40 303 265 045       40303265045
1       84 323 140 391          NaN
Return type

DataFrame

dataprep.clean.clean_fr_tva.validate_fr_tva(df, column='')[source]

Validate if a data cell is TVA in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Stock Exchange Daily Official List Numbers

Clean and validate a DataFrame column containing Stock Exchange Daily Official List numbers (SEDOLs).

dataprep.clean.clean_gb_sedol.clean_gb_sedol(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Stock Exchange Daily Official List number in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of SEDOL type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘isin’, convert the number to an ISIN. Note: in the case of SEDOL, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of SEDOL data.

>>> df = pd.DataFrame({{
        "sedol": [
        'B15KXQ8',
        'B15KXQ7']
        })
>>> clean_gb_sedol(df, 'sedol')
        sedol           sedol_clean
0       B15KXQ8         B15KXQ8
1       B15KXQ7         NaN
Return type

DataFrame

dataprep.clean.clean_gb_sedol.validate_gb_sedol(df, column='')[source]

Validate if a data cell is SEDOL in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

English Unique Pupil Numbers

Clean and validate a DataFrame column containing English Unique Pupil Numbers (UPNs).

dataprep.clean.clean_gb_upn.clean_gb_upn(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean English Unique Pupil Numbers (UPNs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of UPN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of UPN, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of UPN data.

>>> df = pd.DataFrame({{
        "upn": [
        'B801200005001',
        'A801200005001']
        })
>>> clean_gb_upn(df, 'upn')
        upn                 upn_clean
0       B801200005001       B801200005001
1       A801200005001       NaN
Return type

DataFrame

dataprep.clean.clean_gb_upn.validate_gb_upn(df, column='')[source]

Validate if a data cell is UPN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

United Kingdom Unique Taxpayer References

Clean and validate a DataFrame column containing United Kingdom Unique Taxpayer Reference (UTRs).

dataprep.clean.clean_gb_utr.clean_gb_utr(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean United Kingdom Unique Taxpayer Reference (UTRs) in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of UTR type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of UTR, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of UTR data.

>>> df = pd.DataFrame({{
        "utr": [
        '1955839661',
        '2955839661',]
        })
>>> clean_gb_utr(df, 'utr')
        utr             utr_clean
0       1955839661      1955839661
1       2955839661      NaN
Return type

DataFrame

dataprep.clean.clean_gb_utr.validate_gb_utr(df, column='')[source]

Validate if a data cell is UTR in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

United Kingdom VAT Numbers

Clean and validate a DataFrame column containing United Kingdom VAT numbers (VATs).

dataprep.clean.clean_gb_vat.clean_gb_vat(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean United Kingdom VAT numbers (VATs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of VAT type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of VAT data.

>>> df = pd.DataFrame({{
        "vat": [
        "980780684",
        "802311781"]
        })
>>> clean_gb_vat(df, 'vat')
        vat                 vat_clean
0       980780684           980 7806 84
1       802311781           NaN
Return type

DataFrame

dataprep.clean.clean_gb_vat.validate_gb_vat(df, column='')[source]

Validate if a data cell is VAT in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Greek Social Security Numbers

Clean and validate a DataFrame column containing Greek social security numbers (AMKAs).

dataprep.clean.clean_gr_amka.clean_gr_amka(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Greek social security numbers (AMKAs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of AMKA type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, get the person’s birthdate. If output_format = ‘gender’, get the person’s birth gender (‘M’ or ‘F’). Note: in the case of AMKA, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of AMKA data.

>>> df = pd.DataFrame({{
        "amka": [
        '01013099997',
        '01013099999']
        })
>>> clean_gr_amka(df, 'amka')
        amka            amka_clean
0       01013099997     01013099997
1       01013099999     NaN
Return type

DataFrame

dataprep.clean.clean_gr_amka.validate_gr_amka(df, column='')[source]

Validate if a data cell is AMKA in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Greek VAT Numbers

Clean and validate a DataFrame column containing Greek VAT numbers (VATs).

dataprep.clean.clean_gr_vat.clean_gr_vat(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Greek VAT numbers (VATs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of VAT type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of VAT, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of VAT data.

>>> df = pd.DataFrame({{
        "vat": [
        'EL 094259216',
        'EL 123456781']
        })
>>> clean_gr_vat(df, 'vat')
        vat             vat_clean
0       EL 094259216    094259216
1       EL 123456781    NaN
Return type

DataFrame

dataprep.clean.clean_gr_vat.validate_gr_vat(df, column='')[source]

Validate if a data cell is VAT in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Guatemala Tax Numbers

Clean and validate a DataFrame column containing Guatemala tax numbers (NITs).

dataprep.clean.clean_gt_nit.clean_gt_nit(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Guatemala tax numbers (NITs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of NIT type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of NIT data.

>>> df = pd.DataFrame({{
        "nit": [
        "39525503",
        "8977112-0",]
        })
>>> clean_gt_nit(df, 'nit')
        nit                 nit_clean
0       39525503            3952550-3
1       8977112-0           NaN
Return type

DataFrame

dataprep.clean.clean_gt_nit.validate_gt_nit(df, column='')[source]

Validate if a data cell is NIT in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Croatian Identification Numbers

Clean and validate a DataFrame column containing Croatian identification numbers (OIBs).

dataprep.clean.clean_hr_oib.clean_hr_oib(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Croatian identification numbers (OIBs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of OIB type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of OIB, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of OIB data.

>>> df = pd.DataFrame({{
        "oib": [
        'HR 33392005961',
        '33392005962',]
        })
>>> clean_hr_oib(df, 'oib')
        oib               oib_clean
0       HR 33392005961    33392005961
1       33392005962       NaN
Return type

DataFrame

dataprep.clean.clean_hr_oib.validate_hr_oib(df, column='')[source]

Validate if a data cell is OIB in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Hungarian ANUM Numbers

Clean and validate a DataFrame column containing Hungarian ANUM numbers (ANUMs).

dataprep.clean.clean_hu_anum.clean_hu_anum(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Hungarian ANUM numbers (ANUMs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of ANUM type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of ANUM, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of ANUM data.

>>> df = pd.DataFrame({{
        "anum": [
        'HU-12892312',
        'HU-12892313',]
        })
>>> clean_hu_anum(df, 'anum')
        anum             anum_clean
0       HU-12892312      12892312
1       HU-12892313      NaN
Return type

DataFrame

dataprep.clean.clean_hu_anum.validate_hu_anum(df, column='')[source]

Validate if a data cell is ANUM in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Indonesian VAT Numbers

Clean and validate a DataFrame column containing Indonesian VAT Numbers (NPWPs).

dataprep.clean.clean_id_npwp.clean_id_npwp(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Indonesian VAT Numbers (NPWPs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of NPWP type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of NPWP data.

>>> df = pd.DataFrame({{
        "npwp": [
        "013000666091000",
        "123456789",]
        })
>>> clean_id_npwp(df, 'npwp')
        npwp                    npwp_clean
0       013000666091000         01.300.066.6-091.000
1       123456789               NaN
Return type

DataFrame

dataprep.clean.clean_id_npwp.validate_id_npwp(df, column='')[source]

Validate if a data cell is NPWP in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Irish Personal Numbers

Clean and validate a DataFrame column containing Irish personal numbers (PPSs).

dataprep.clean.clean_ie_pps.clean_ie_pps(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Irish personal numbers (PPSs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of PPS type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of PPS, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of PPS data.

>>> df = pd.DataFrame({{
        "pps": [
        '6433435OA',
        '6433435VH',]
        })
>>> clean_ie_pps(df, 'pps')
        pps             pps_clean
0       6433435OA       6433435OA
1       6433435VH       NaN
Return type

DataFrame

dataprep.clean.clean_ie_pps.validate_ie_pps(df, column='')[source]

Validate if a data cell is PPS in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Irish VAT Numbers

Clean and validate a DataFrame column containing Irish VAT numbers (VATs).

dataprep.clean.clean_ie_vat.clean_ie_vat(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Irish VAT numbers (VATs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of VAT type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of VAT, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of VAT data.

>>> df = pd.DataFrame({{
        "vat": [
        'IE 6433435OA',
        '6433435E',]
        })
>>> clean_ie_vat(df, 'vat')
        vat             vat_clean
0       IE 6433435OA    6433435OA
1       6433435E        NaN
Return type

DataFrame

dataprep.clean.clean_ie_vat.validate_ie_vat(df, column='')[source]

Validate if a data cell is VAT in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Israeli Company Numbers

Clean and validate a DataFrame column containing Israeli company numbers (HPs).

dataprep.clean.clean_il_hp.clean_il_hp(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Israeli company numbers (HPs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of HP type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of HP, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of HP data.

>>> df = pd.DataFrame({{
        "hp": [
        ' 5161 79157 ',
        '516179150',]
        })
>>> clean_il_hp(df, 'hp')
        hp              hp_clean
0        5161 79157     516179157
1       516179150       NaN
Return type

DataFrame

dataprep.clean.clean_il_hp.validate_il_hp(df, column='')[source]

Validate if a data cell is HP in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Israeli Personal Numbers

Clean and validate a DataFrame column containing Israeli personal numbers (IDNRs).

dataprep.clean.clean_il_idnr.clean_il_idnr(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Israeli personal numbers (IDNRs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of IDNR type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of IDNR data.

>>> df = pd.DataFrame({{
        "idnr": [
        "39337423",
        "3933742-2",]
        })
>>> clean_il_idnr(df, 'idnr')
        idnr             idnr_clean
0       39337423         03933742-3
1       3933742-2        NaN
Return type

DataFrame

dataprep.clean.clean_il_idnr.validate_il_idnr(df, column='')[source]

Validate if a data cell is IDNR in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Indian Digital Resident Personal Identity Numbers

Clean and validate a DataFrame column containing Indian digital resident personal identity numbers (Aadhaars).

dataprep.clean.clean_in_aadhaar.clean_in_aadhaar(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Indian digital resident personal identity number (Aadhaars) in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of Aadhaar type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘mask’, mask the first 8 digits as per MeitY guidelines for

    securing identity information and Sensitive personal data.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of Aadhaar data.

>>> df = pd.DataFrame({{
        "aadhaar": [
        "234123412346",
        "643343121",]
        })
>>> clean_in_aadhaar(df, 'aadhaar')
        aadhaar                 aadhaar_clean
0       234123412346            2341 2341 2346
1       643343121               NaN
Return type

DataFrame

dataprep.clean.clean_in_aadhaar.validate_in_aadhaar(df, column='')[source]

Validate if a data cell is Aadhaar in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Indian Permanent Account Numbers

Clean and validate a DataFrame column containing Indian Permanent Account numbers (PANs).

dataprep.clean.clean_in_pan.clean_in_pan(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Indian Permanent Account numbers (PANs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of PAN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘info’, return a dictionary containing information

    that can be decoded from the PAN.

    If output_format = ‘mask’, mask the PAN as per CBDT masking standard. Note: in the case of PAN, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of PAN data.

>>> df = pd.DataFrame({{
        "pan": [
        'ACUPA7085R',
        '234123412347',]
        })
>>> clean_in_pan(df, 'pan')
        pan             pan_clean
0       ACUPA7085R      ACUPA7085R
1       234123412347    NaN
Return type

DataFrame

dataprep.clean.clean_in_pan.validate_in_pan(df, column='')[source]

Validate if a data cell is PAN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Icelandic Identity Codes

Clean and validate a DataFrame column containing Icelandic identity codes (Kennitalas).

dataprep.clean.clean_is_kennitala.clean_is_kennitala(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Icelandic identity codes (Kennitalas) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of Kennitala type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of Kennitala data.

>>> df = pd.DataFrame({{
        "kennitala": [
        "1201743399",
        "320174-3399",]
        })
>>> clean_is_kennitala(df, 'kennitala')
        kennitala          kennitala_clean
0       1201743399         120174-3399
1       320174-3399        NaN
Return type

DataFrame

dataprep.clean.clean_is_kennitala.validate_is_kennitala(df, column='')[source]

Validate if a data cell is Kennitala in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Icelandic VSK Numbers

Clean and validate a DataFrame column containing Icelandic VSK numbers (VSKs).

dataprep.clean.clean_is_vsk.clean_is_vsk(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Icelandic VSK numbers (VSKs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of VSK type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of VSK, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of VSK data.

>>> df = pd.DataFrame({{
        "vsk": [
        'IS 00621',
        'IS 0062199',]
        })
>>> clean_is_vsk(df, 'vsk')
        vsk             vsk_clean
0       IS 00621        00621
1       IS 0062199      NaN
Return type

DataFrame

dataprep.clean.clean_is_vsk.validate_is_vsk(df, column='')[source]

Validate if a data cell is VSK in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Italian Code For Identification Of Drugs

Clean and validate a DataFrame column containing Italian code for identification of drugs (AICs).

dataprep.clean.clean_it_aic.clean_it_aic(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Italian code for identification of drugs (AICs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of AIC type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘base10’, convert a BASE32 representation to a BASE10 one. If output_format = ‘base32’, convert a BASE10 representation to a BASE32 one. Note: in the case of AIC, the compact format is the same as the standard one.

    And ‘compact’ may contain both BASE10 and BASE32 represatation.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of AIC data.

>>> df = pd.DataFrame({{
        "aic": [
        '000307052',
        '999999',]
        })
>>> clean_it_aic(df, 'aic')
        aic             aic_clean
0       000307052       000307052
1       999999          NaN
Return type

DataFrame

dataprep.clean.clean_it_aic.validate_it_aic(df, column='')[source]

Validate if a data cell is AIC in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Italian Fiscal Codes

Clean and validate a DataFrame column containing Italian fiscal codes (Codice Fiscales).

dataprep.clean.clean_it_codicefiscale.clean_it_codicefiscale(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Italian fiscal code (Codice Fiscales) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of Codice Fiscale type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, get the person’s birthdate. If output_format = ‘gender’, get the person’s birth gender (‘M’ or ‘F’). Note: in the case of Codice Fiscale, the compact format is the same as the standard.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of Codice Fiscale data.

>>> df = pd.DataFrame({{
        "codicefiscale": [
        'RCCMNL83S18D969H',
        'RCCMNL83S18D969']
        })
>>> clean_it_codicefiscale(df, 'codicefiscale')
        codicefiscale        codicefiscale_clean
0       RCCMNL83S18D969H     RCCMNL83S18D969H
1       RCCMNL83S18D969      NaN
Return type

DataFrame

dataprep.clean.clean_it_codicefiscale.validate_it_codicefiscale(df, column='')[source]

Validate if a data cell is Codice Fiscale in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Italian IVA Numbers

Clean and validate a DataFrame column containing Italian IVA numbers (IVAs).

dataprep.clean.clean_it_iva.clean_it_iva(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Italian IVA numbers (IVAs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of IVA type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of IVA, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of IVA data.

>>> df = pd.DataFrame({{
        "iva": [
        'IT 00743110157',
        '00743110158',]
        })
>>> clean_it_iva(df, 'iva')
        iva                 iva_clean
0       IT 00743110157      00743110157
1       00743110158         NaN
Return type

DataFrame

dataprep.clean.clean_it_iva.validate_it_iva(df, column='')[source]

Validate if a data cell is IVA in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Japanese Corporate Numbers

Clean and validate a DataFrame column containing Japanese Corporate Numbers (CNs).

dataprep.clean.clean_jp_cn.clean_jp_cn(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Japanese Corporate Numbers (CNs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of CN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of CN data.

>>> df = pd.DataFrame({{
        "cn": [
        "5835678256246",
        "2-8356-7825-6246",]
        })
>>> clean_jp_cn(df, 'cn')
        cn                 cn_clean
0       5835678256246      5-8356-7825-6246
1       2-8356-7825-6246   NaN
Return type

DataFrame

dataprep.clean.clean_jp_cn.validate_jp_cn(df, column='')[source]

Validate if a data cell is CN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

South Korea Business Registration Numbers

Clean and validate a DataFrame column containing South Korea Business Registration Numbers (BRNs).

dataprep.clean.clean_kr_brn.clean_kr_brn(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean South Korea Business Registration Numbers (BRNs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of BRN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of BRN data.

>>> df = pd.DataFrame({{
        "brn": [
        "1348672683",
        "123456789",]
        })
>>> clean_kr_brn(df, 'brn')
        brn                 brn_clean
0       1348672683          134-86-72683
1       123456789           NaN
Return type

DataFrame

dataprep.clean.clean_kr_brn.validate_kr_brn(df, column='')[source]

Validate if a data cell is BRN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

South Korean Resident Registration Numbers

Clean and validate a DataFrame column containing South Korean resident registration numbers (RRNs).

dataprep.clean.clean_kr_rrn.clean_kr_rrn(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean South Korean resident registration numbers (RRNs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of RRN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, get the person’s birthdate.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of RRN data.

>>> df = pd.DataFrame({{
        "rrn": [
        "971013-9019902",
        "971013-9019903",]
        })
>>> clean_kr_rrn(df, 'rrn')
        rrn                 rrn_clean
0       971013-9019902      971013-9019902
1       971013-9019903      NaN
Return type

DataFrame

dataprep.clean.clean_kr_rrn.validate_kr_rrn(df, column='')[source]

Validate if a data cell is RRN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Liechtenstein Tax Code For Individuals And Entities

Clean and validate a DataFrame column containing Liechtenstein tax code for individuals and entities (PEIDs).

dataprep.clean.clean_li_peid.clean_li_peid(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Liechtenstein tax code for individuals and entities data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of PEID type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of PEID, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of PEID data.

>>> df = pd.DataFrame({{
        "peid": [
        '00001234567',
        '00001234568913454545',]
        })
>>> clean_li_peid(df, 'peid')
        peid                    peid_clean
0       00001234567             1234567
1       00001234568913454545    NaN
Return type

DataFrame

dataprep.clean.clean_li_peid.validate_li_peid(df, column='')[source]

Validate if a data cell is PEID in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Lithuanian Personal Numbers

Clean and validate a DataFrame column containing Lithuanian personal numbers (Asmens kodas).

dataprep.clean.clean_lt_asmens.clean_lt_asmens(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Lithuanian personal numbers (Asmens kodas) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of Asmens kodas type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, return the birthdate of the person. Note: in the case of Asmens kodas, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of Asmens kodas data.

>>> df = pd.DataFrame({{
        "asmens": [
        '33309240064',
        '33309240164',]
        })
>>> clean_lt_asmens(df, 'asmens')
        asmens          asmens_clean
0       33309240064     33309240064
1       33309240164     NaN
Return type

DataFrame

dataprep.clean.clean_lt_asmens.validate_lt_asmens(df, column='')[source]

Validate if a data cell is Asmens kodas in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Lithuanian PVM Numbers

Clean and validate a DataFrame column containing Lithuanian PVM numbers (PVMs).

dataprep.clean.clean_lt_pvm.clean_lt_pvm(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Lithuanian PVM numbers (PVMs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of PVM type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of PVM, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of PVM data.

>>> df = pd.DataFrame({{
        "pvm": [
        '119511515',
        '100001919018',]
        })
>>> clean_lt_pvm(df, 'pvm')
        pvm              pvm_clean
0       119511515        119511515
1       100001919018     NaN
Return type

DataFrame

dataprep.clean.clean_lt_pvm.validate_lt_pvm(df, column='')[source]

Validate if a data cell is PVM in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Luxembourgian TVA Numbers

Clean and validate a DataFrame column containing Luxembourgian TVA numbers (TVAs).

dataprep.clean.clean_lu_tva.clean_lu_tva(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Luxembourgian TVA numbers (TVAs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of TVA type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of TVA, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of TVA data.

>>> df = pd.DataFrame({{
        "tva": [
        'LU 150 274 42',
        '150 274 43',]
        })
>>> clean_lu_tva(df, 'tva')
        tva                 tva_clean
0       LU 150 274 42       15027442
1       150 274 43          NaN
Return type

DataFrame

dataprep.clean.clean_lu_tva.validate_lu_tva(df, column='')[source]

Validate if a data cell is TVA in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Latvian PVN (VAT) Numbers

Clean and validate a DataFrame column containing Latvian PVN (VAT) numbers (PVNs).

dataprep.clean.clean_lv_pvn.clean_lv_pvn(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Latvian PVN (VAT) numbers (PVNs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of PVN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, return the birthdate of the person. Note only when

    PVN refers to a person (but not a legal entity) this format will be available.

    Note: in the case of PVN, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of PVN data.

>>> df = pd.DataFrame({{
        "pvn": [
        '161175-19997',
        '40003521601',]
        })
>>> clean_lv_pvn(df, 'pvn')
        pvn                     pvn_clean
0       161175-19997            16117519997
1       40003521601             NaN
Return type

DataFrame

dataprep.clean.clean_lv_pvn.validate_lv_pvn(df, column='')[source]

Validate if a data cell is PVN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Monacan TVA Numbers

Clean and validate a DataFrame column containing Monacan TVA numbers (TVAs).

dataprep.clean.clean_mc_tva.clean_mc_tva(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Monacan TVA numbers (TVAs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of TVA type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of TVA, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of TVA data.

>>> df = pd.DataFrame({{
        "tva": [
        '53 0000 04605',
        'FR 61 954 506 077',]
        })
>>> clean_mc_tva(df, 'tva')
        tva                     tva_clean
0       53 0000 04605           FR53000004605
1       FR 61 954 506 077       NaN
Return type

DataFrame

dataprep.clean.clean_mc_tva.validate_mc_tva(df, column='')[source]

Validate if a data cell is TVA in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Moldavian Company Identification Numbers

Clean and validate a DataFrame column containing Moldavian company identification numbers (IDNOs).

dataprep.clean.clean_md_idno.clean_md_idno(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Moldavian company identification numbers (IDNOs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of IDNO type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of IDNO, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of IDNO data.

>>> df = pd.DataFrame({{
        "idno": [
        '1008600038413',
        '1008600038412',]
        })
>>> clean_md_idno(df, 'idno')
        idno                idno_clean
0       1008600038413       1008600038413
1       1008600038412       NaN
Return type

DataFrame

dataprep.clean.clean_md_idno.validate_md_idno(df, column='')[source]

Validate if a data cell is IDNO in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Montenegro IBANs

Clean and validate a DataFrame column containing Montenegro IBANs (IBANs).

dataprep.clean.clean_me_iban.clean_me_iban(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Montenegro IBANs (IBANs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of IBAN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of IBAN data.

>>> df = pd.DataFrame({{
        "iban": [
        "ME25510000000006234133",
        "ME52510000000006234132",]
        })
>>> clean_me_iban(df, 'iban')
        iban                                iban_clean
0       ME25510000000006234133              ME 2551 0000 0000 0623 4133
1       ME52510000000006234132              NaN
Return type

DataFrame

dataprep.clean.clean_me_iban.validate_me_iban(df, column='')[source]

Validate if a data cell is IBAN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Maltese VAT Numbers

Clean and validate a DataFrame column containing Maltese VAT numbers (VATs).

dataprep.clean.clean_mt_vat.clean_mt_vat(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Maltese VAT numbers (VATs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of VAT type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of VAT, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of VAT data.

>>> df = pd.DataFrame({{
        "vat": [
        'MT 1167-9112',
        '1167-9113',]
        })
>>> clean_mt_vat(df, 'vat')
        vat             vat_clean
0       MT 1167-9112    11679112
1       1167-9113       NaN
Return type

DataFrame

dataprep.clean.clean_mt_vat.validate_mt_vat(df, column='')[source]

Validate if a data cell is VAT in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Mauritian National ID Numbers

Clean and validate a DataFrame column containing Mauritian national ID numbers (NIDs).

dataprep.clean.clean_mu_nid.clean_mu_nid(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Mauritian national ID numbers (NIDs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of NID type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, return the birthdate of the person. Note: in the case of NID, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of NID data.

>>> df = pd.DataFrame({{
        "nid": [
        'J2906201304089',
        'J2906201304088',]
        })
>>> clean_mu_nid(df, 'nid')
        nid                 nid_clean
0       J2906201304089      J2906201304089
1       J2906201304088      NaN
Return type

DataFrame

dataprep.clean.clean_mu_nid.validate_mu_nid(df, column='')[source]

Validate if a data cell is NID in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Mexican Personal Identifiers

Clean and validate a DataFrame column containing Mexican personal identifiers (CURPs).

dataprep.clean.clean_mx_curp.clean_mx_curp(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Estonian Personcal ID number (CURPs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of CURP type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, get the person’s birthdate. If output_format = ‘gender’, get the person’s birth gender (‘M’ or ‘F’). Note: in the case of CURP, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of CURP data.

>>> df = pd.DataFrame({{
        "curp": [
        'BOXW310820HNERXN09',
        'BOXW310820HNERXN08']
        })
>>> clean_mx_curp(df, 'curp')
        curp                    curp_clean
0       BOXW310820HNERXN09      BOXW310820HNERXN09
1       BOXW310820HNERXN08      NaN
Return type

DataFrame

dataprep.clean.clean_mx_curp.validate_mx_curp(df, column='')[source]

Validate if a data cell is CURP in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Mexican Tax Numbers

Clean and validate a DataFrame column containing Mexican tax numbers (RFCs).

dataprep.clean.clean_mx_rfc.clean_mx_rfc(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Mexican tax numbers (RFCs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of RFC type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of RFC data.

>>> df = pd.DataFrame({{
        "rfc": [
        "GODE561231GR8",
        "BUEI591231GH9",]
        })
>>> clean_mx_rfc(df, 'rfc')
        rfc                 rfc_clean
0       GODE561231GR8       GODE 561231 GR8
1       BUEI591231GH9       NaN
Return type

DataFrame

dataprep.clean.clean_mx_rfc.validate_mx_rfc(df, column='')[source]

Validate if a data cell is RFC in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Malaysian National Registration Identity Card Numbers

Clean and validate a DataFrame column containing Malaysian National Registration Identity Card Numbers (NRICs).

dataprep.clean.clean_my_nric.clean_my_nric(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Malaysian National Registration Identity Card Numbers (NRICs) in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of NRIC type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, return the registration date or the birth date. If output_format = ‘birthplace’, return a dict containing the birthplace of the person.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of NRIC data.

>>> df = pd.DataFrame({{
        "nric": [
        "770305021234",
        "771305-02-1234",]
        })
>>> clean_my_nric(df, 'nric')
        nric                 nric_clean
0       770305021234         770305-02-1234
1       771305-02-1234       NaN
Return type

DataFrame

dataprep.clean.clean_my_nric.validate_my_nric(df, column='')[source]

Validate if a data cell is NRIC in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

BRIN Numbers

Clean and validate a DataFrame column containing Brin numbers (BRINs).

dataprep.clean.clean_nl_brin.clean_nl_brin(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Brin numbers (BRINs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of BRIN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of BRIN, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of BRIN data.

>>> df = pd.DataFrame({{
        "brin": [
        '05 KO',
        '30AJ0A',]
        })
>>> clean_nl_brin(df, 'brin')
        brin             brin_clean
0       05 KO            05KO
1       30AJ0A           NaN
Return type

DataFrame

dataprep.clean.clean_nl_brin.validate_nl_brin(df, column='')[source]

Validate if a data cell is BRIN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Dutch BTW Numbers

Clean and validate a DataFrame column containing Dutch BTW numbers (BTWs).

dataprep.clean.clean_nl_btw.clean_nl_btw(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Dutch BTW numbers (BTWs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of BTW type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of BTW, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of BTW data.

>>> df = pd.DataFrame({{
        "btw": [
        '004495445B01',
        '123456789B90',]
        })
>>> clean_nl_btw(df, 'btw')
        btw             btw_clean
0       004495445B01    004495445B01
1       123456789B90    NaN
Return type

DataFrame

dataprep.clean.clean_nl_btw.validate_nl_btw(df, column='')[source]

Validate if a data cell is BTW in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Norwegian IBANs

Clean and validate a DataFrame column containing Norwegian IBANs (IBANs).

dataprep.clean.clean_no_iban.clean_no_iban(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Norwegian IBANs (IBANs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of IBAN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘kontonr’, return the Norwegian bank account part of the number.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of IBAN data.

>>> df = pd.DataFrame({
        "iban": [
        'NO9386011117947',
        'NO92 8601 1117 947',]
        })
>>> clean_no_iban(df, 'iban')
        iban                    iban_clean
0       NO9386011117947         NO93 8601 1117 947
1       NO92 8601 1117 947      NaN
Return type

DataFrame

dataprep.clean.clean_no_iban.validate_no_iban(df, column='')[source]

Validate if a data cell is IBAN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Norwegian Bank Account Numbers

Clean and validate a DataFrame column containing Norwegian bank account numbers (kontonrs).

dataprep.clean.clean_no_kontonr.clean_no_kontonr(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Norwegian bank account numbers (kontonrs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of kontonr type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘iban’, convert the number to an IBAN.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of kontonr data.

>>> df = pd.DataFrame({
        "kontonr": [
        "8601 11 17947",
        "8601 11 17949",]
        })
>>> clean_no_kontonr(df, 'kontonr')
        kontonr               kontonr_clean
0       8601 11 17947         8601.11.17947
1       8601 11 17949         NaN
Return type

DataFrame

dataprep.clean.clean_no_kontonr.validate_no_kontonr(df, column='')[source]

Validate if a data cell is kontonr in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Norwegian VAT Numbers

Clean and validate a DataFrame column containing Norwegian VAT numbers (MVAs).

dataprep.clean.clean_no_mva.clean_no_mva(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Norwegian VAT numbers (MVAs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of MVA type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of MVA data.

>>> df = pd.DataFrame({{
        "mva": [
        "995525828MVA",
        "NO 995 525 829 MVA",]
        })
>>> clean_no_mva(df, 'mva')
        mva                     mva_clean
0       995525828MVA            NO 995 525 828 MVA
1       NO 995 525 829 MVA      NaN
Return type

DataFrame

dataprep.clean.clean_no_mva.validate_no_mva(df, column='')[source]

Validate if a data cell is MVA in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Norwegian Organisation Numbers

Clean and validate a DataFrame column containing Norwegian organisation numbers (Orgnrs).

dataprep.clean.clean_no_orgnr.clean_no_orgnr(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Norwegian organisation numbers (Orgnrs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of Orgnr type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of Orgnr data.

>>> df = pd.DataFrame({
        "orgnr": [
        "988077917",
        "988 077 918",]
        })
>>> clean_no_orgnr(df, 'orgnr')
        orgnr               orgnr_clean
0       988077917           988 077 917
1       988 077 918         NaN
Return type

DataFrame

dataprep.clean.clean_no_orgnr.validate_no_orgnr(df, column='')[source]

Validate if a data cell is Orgnr in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

New Zealand IRD Numbers

Clean and validate a DataFrame column containing New Zealand IRD numbers (IRDs).

dataprep.clean.clean_nz_ird.clean_nz_ird(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean New Zealand IRD numbers (IRDs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of IRD type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of IRD data.

>>> df = pd.DataFrame({
        "ird": [
        "49091850",
        "136410133",]
        })
>>> clean_nz_ird(df, 'ird')
        ird              ird_clean
0       49091850         49-091-850
1       136410133        NaN
Return type

DataFrame

dataprep.clean.clean_nz_ird.validate_nz_ird(df, column='')[source]

Validate if a data cell is IRD in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Peruvian Personal Numbers

Clean and validate a DataFrame column containing Peruvian personal numbers (CUIs).

dataprep.clean.clean_pe_cui.clean_pe_cui(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Peruvian personal numbers (CUIs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of CUI type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘ruc’, convert the number to a valid RUC. Note: in the case of CUI, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of CUI data.

>>> df = pd.DataFrame({
        "cui": [
        "10117410",
        "10117410-3",]
        })
>>> clean_pe_cui(df, 'cui')
        cui             cui_clean
0       10117410        10117410
1       10117410-3      NaN
Return type

DataFrame

dataprep.clean.clean_pe_cui.validate_pe_cui(df, column='')[source]

Validate if a data cell is CUI in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Peruvian Fiscal Numbers

Clean and validate a DataFrame column containing Peruvian fiscal numbers (RUCs).

dataprep.clean.clean_pe_ruc.clean_pe_ruc(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Peruvian fiscal numbers (RUCs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of RUC type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘dni’, return the DNI (CUI) part of the number for natural persons.

    If the RUC is not for natural persons, return NaN.

    Note: in the case of RUC, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of RUC data.

>>> df = pd.DataFrame({
        "ruc": [
        "20512333797",
        "20512333798",]
        })
>>> clean_pe_ruc(df, 'ruc')
        ruc             ruc_clean
0       20512333797     20512333797
1       20512333798     NaN
Return type

DataFrame

dataprep.clean.clean_pe_ruc.validate_pe_ruc(df, column='')[source]

Validate if a data cell is RUC in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Polish VAT Numbers

Clean and validate a DataFrame column containing Polish VAT numbers (NIPs).

dataprep.clean.clean_pl_nip.clean_pl_nip(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Polish VAT numbers (NIPs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of NIP type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of NIP data.

>>> df = pd.DataFrame({{
        "nip": [
        "PL 8567346215",
        "PL 8567346216",]
        })
>>> clean_pl_nip(df, 'nip')
        nip                 nip_clean
0       PL 8567346215       856-734-62-15
1       PL 8567346216       NaN
Return type

DataFrame

dataprep.clean.clean_pl_nip.validate_pl_nip(df, column='')[source]

Validate if a data cell is NIP in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Polish National Identification Numbers

Clean and validate a DataFrame column containing Polish national identification numbers (PESELs).

dataprep.clean.clean_pl_pesel.clean_pl_pesel(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Estonian Personcal ID number (PESELs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of PESEL type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, get the person’s birthdate. If output_format = ‘gender’, get the person’s birth gender (‘M’ or ‘F’). Note: in the case of PESEL, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of PESEL data.

>>> df = pd.DataFrame({
        "pesel": [
        "44051401359",
        "44051401358",]
        })
>>> clean_pl_pesel(df, 'pesel')
        pesel           pesel_clean
0       44051401359     44051401359
1       44051401358     NaN
Return type

DataFrame

dataprep.clean.clean_pl_pesel.validate_pl_pesel(df, column='')[source]

Validate if a data cell is PESEL in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Polish Register Of Economic Units

Clean and validate a DataFrame column containing Polish register of economic units (REGONs).

dataprep.clean.clean_pl_regon.clean_pl_regon(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Polish register of economic units (REGONs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of REGON type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of REGON, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of REGON data.

>>> df = pd.DataFrame({
        "regon": [
        '192598184',
        '192598183',]
        })
>>> clean_pl_regon(df, 'regon')
        regon           regon_clean
0       192598184       192598184
1       192598183       NaN
Return type

DataFrame

dataprep.clean.clean_pl_regon.validate_pl_regon(df, column='')[source]

Validate if a data cell is REGON in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Portuguese NIF Numbers

Clean and validate a DataFrame column containing Portuguese NIF numbers (NIFs).

dataprep.clean.clean_pt_nif.clean_pt_nif(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Portuguese NIF numbers (NIFs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of NIF type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of NIF, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of NIF data.

>>> df = pd.DataFrame({
        "nif": [
        'PT 501 964 843',
        'PT 501 964 842',]
        })
>>> clean_pt_nif(df, 'nif')
        nif                 nif_clean
0       PT 501 964 843      501964843
1       PT 501 964 842      NaN
Return type

DataFrame

dataprep.clean.clean_pt_nif.validate_pt_nif(df, column='')[source]

Validate if a data cell is NIF in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Paraguay RUC Numbers

Clean and validate a DataFrame column containing Paraguay RUC numbers (RUCs).

dataprep.clean.clean_py_ruc.clean_py_ruc(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Paraguay RUC numbers (RUCs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of RUC type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of RUC data.

>>> df = pd.DataFrame({
        "ruc": [
        "800000358",
        "80123456789",]
        })
>>> clean_py_ruc(df, 'ruc')
        ruc                 ruc_clean
0       800000358           80000035-8
1       80123456789         NaN
Return type

DataFrame

dataprep.clean.clean_py_ruc.validate_py_ruc(df, column='')[source]

Validate if a data cell is RUC in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Romanian CF (VAT) Numbers

Clean and validate a DataFrame column containing Romanian CF (VAT) numbers (CFs).

dataprep.clean.clean_ro_cf.clean_ro_cf(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Romanian CF (VAT) numbers (CFs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of CF type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of CF, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of CF data.

>>> df = pd.DataFrame({
        "cf": [
        "RO 185 472 90",
        "RO 185 472 903333",]
        })
>>> clean_ro_cf(df, 'cf')
        cf                      cf_clean
0       RO 185 472 90           RO18547290
1       RO 185 472 903333       NaN
Return type

DataFrame

dataprep.clean.clean_ro_cf.validate_ro_cf(df, column='')[source]

Validate if a data cell is CF in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Romanian Numerical Personal Codes

Clean and validate a DataFrame column containing Romanian Numerical Personal Codes (CNPs).

dataprep.clean.clean_ro_cnp.clean_ro_cnp(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Estonian Personcal ID number (CNPs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of CNP type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘birthdate’, get the person’s birthdate. Note: in the case of CNP, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of CNP data.

>>> df = pd.DataFrame({
        "cnp": [
        "1630615123457",
        "0800101221142",]
        })
>>> clean_ro_cnp(df, 'cnp')
        cnp                 cnp_clean
0       1630615123457       1630615123457
1       0800101221142       NaN
Return type

DataFrame

dataprep.clean.clean_ro_cnp.validate_ro_cnp(df, column='')[source]

Validate if a data cell is CNP in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Romanian Company Identifiers

Clean and validate a DataFrame column containing Romanian company identifiers (CUIs).

dataprep.clean.clean_ro_cui.clean_ro_cui(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Romanian company identifiers (CUIs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of CUI type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of CUI, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of CUI data.

>>> df = pd.DataFrame({
        "cui": [
        "185 472 90",
        "185 472 91",]
        })
>>> clean_ro_cui(df, 'cui')
        cui             cui_clean
0       185 472 90      18547290
1       185 472 91      NaN
Return type

DataFrame

dataprep.clean.clean_ro_cui.validate_ro_cui(df, column='')[source]

Validate if a data cell is CUI in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Romanian Trade Register Identifiers

Clean and validate a DataFrame column containing Romanian Trade Register identifiers (ONRCs).

dataprep.clean.clean_ro_onrc.clean_ro_onrc(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Romanian Trade Register identifiers (ONRCs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of ONRC type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of ONRC, the compact format is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of ONRC data.

>>> df = pd.DataFrame({
        "onrc": [
        "J52/750/2012",
        "X52/750/2012",]
        })
>>> clean_ro_onrc(df, 'onrc')
        onrc             onrc_clean
0       J52/750/2012     J52/750/2012
1       X52/750/2012     NaN
Return type

DataFrame

dataprep.clean.clean_ro_onrc.validate_ro_onrc(df, column='')[source]

Validate if a data cell is ONRC in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

French Company Establishment Identification Numbers

Clean and validate a DataFrame column containing French company establishment identification numbers (SIRETs).

dataprep.clean.clean_fr_siret.clean_fr_siret(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean French Company Establishment Identification Numbers (SIRETs) in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of SIRET type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. If output_format = ‘siren’, convert the SIRET number to a SIREN number. If output_format = ‘tva’, convert the SIRET number to a TVA number.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of SIRET data.

>>> df = pd.DataFrame({{
        "siret": [
        "73282932000074",
        "73282932000079"]
        })
>>> clean_fr_siret(df, 'siret')
        siret                 siret_clean
0       73282932000074        732 829 320 00074
1       73282932000079        NaN
Return type

DataFrame

dataprep.clean.clean_fr_siret.validate_fr_siret(df, column='')[source]

Validate if a data cell is SIRET in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

United Kingdom National Health Service Patient Identifiers

Clean and validate a DataFrame column containing United Kingdom National Health Service patient identifier (NHSs).

dataprep.clean.clean_gb_nhs.clean_gb_nhs(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean United Kingdom NHS numbers (NHSs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of NHS type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of NHS data.

>>> df = pd.DataFrame({{
        "nhs": [
        "9434765870",
        "9434765871"]
        })
>>> clean_gb_nhs(df, 'nhs')
        nhs                 nhs_clean
0       9434765870          943 476 5870
1       9434765871          NaN
Return type

DataFrame

dataprep.clean.clean_gb_nhs.validate_gb_nhs(df, column='')[source]

Validate if a data cell is NHS in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Dutch Citizen Identification Numbers

Clean and validate a DataFrame column containing Burgerservicenummer, the Dutch citizen identification numbers (BSNs).

dataprep.clean.clean_nl_bsn.clean_nl_bsn(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Burgerservicenummer (BSNs) type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of BSN type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of BSN data.

>>> df = pd.DataFrame({{
        "bsn": [
        "111222333",
        "1112223334",]
        })
>>> clean_nl_bsn(df, 'bsn')
        bsn                 bsn_clean
0       111222333           1112.22.333
1       1112223334          NaN
Return type

DataFrame

dataprep.clean.clean_nl_bsn.validate_nl_bsn(df, column='')[source]

Validate if a data cell is BSN in a DataFrame column. For each cell, return True or False.

Parameters
  • df (Union[str, Series, Series, DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be validated.

  • col – The name of the column to be validated.

Return type

Union[bool, Series, DataFrame]

Dutch Student Identification Numbers

Clean and validate a DataFrame column containing Onderwijsnummer, the Dutch student identification number.

dataprep.clean.clean_nl_onderwijsnummer.clean_nl_onderwijsnummer(df, column, output_format='standard', inplace=False, errors='coerce', progress=True)[source]

Clean Onderwijsnummer type data in a DataFrame column.

Parameters
  • df (Union[DataFrame, DataFrame]) – A pandas or Dask DataFrame containing the data to be cleaned.

  • col – The name of the column containing data of onderwijsnummer type.

  • output_format (str) –

    The output format of standardized number string. If output_format = ‘compact’, return string without any separators or whitespace. If output_format = ‘standard’, return string with proper separators and whitespace. Note: in the case of onderwijsnummer, the compact is the same as the standard one.

    (default: “standard”)

  • inplace (bool) –

    If True, delete the column containing the data that was cleaned. Otherwise, keep the original column.

    (default: False)

  • errors (str) –

    How to handle parsing errors. - ‘coerce’: invalid parsing will be set to NaN. - ‘ignore’: invalid parsing will return the input. - ‘raise’: invalid parsing will raise an exception.

    (default: ‘coerce’)

  • progress (bool) –

    If True, display a progress bar.

    (default: True)

Examples

Clean a column of onderwijsnummer data.

>>> df = pd.DataFrame({{
        "onderwijsnummer": [
        '1012.22.331',
        '2112.22.337',]
        })
>>> clean_nl_onderwijsnummer(df, 'onderwijsnummer')
        onderwijsnummer         onderwijsnummer_clean
0       1012.22.331             0403019261
1       2112.22.337             NaN
Return type

DataFrame

dataprep.clean.clean_nl_onderwijsnummer.validate_nl_onderwijsnummer(df, column='')[source