Pandas DataFrame¶

Introduction¶

The function clean_df() performs a set of operations that would be useful for cleaning and standardizing a full Pandas DataFrame.

In general, clean_df() is designed as a function to be called initially when users want to clean a DataFrame. It currently supports the following functionalities via parameters:

clean_headers: aims to clean column headers in a DataFrame and standardizes them in a desired format. It uses the default settings from the function “clean_headers()”, which set column names to “snake” case and strip accents.
data_type_detection: aims to do a column-wise semantic and/or atomic data type detection for the whole DataFrame.
- "semantic": it helps users identify semantic variables (e.g., email) that are supported by DataPrep validating functions. Hence, users can later call the corresponding cleaning functions on these variables (e.g., “clean_email()”).
- "atomic": it helps users identify the basic data types that are built into Python (e.g., interger, floating, string).
- "none": no results will be returned.
standardize_missing_values: aims to standardize all missing entries to be a uniform format.
- "fill": it sets all detected missing values to be “np.nan” or “pd.NaT”.
- "remove": it removes all rows with any missing values (hence, return a complete DataFrame).
- "ignore": no actions will be taken.
remove_duplicate_entries: aims to remove the repetitive rows in the DataFrame.
downcast_memory: aims to downcast the column data types to be the smallest possible, which saves storage memory for the DataFrame.

After cleaning, a report is printed that provides the following information: * The number and percentage of header values that were cleaned from clean_headers. * A list of supported semantic data types by DataPrep from data_type_detection: “semantic”. * How many data entries are removed from standardize_missing_values: “remove” and/or remove_duplicate_entries. * How much memory is saved only from downcast_memory.

The following sections demonstrate the functionality of clean_df().

An example dirty dataset¶

[1]:

import pandas as pd
import numpy as np
df = pd.DataFrame({"Name":
                   ["Abby", "Scott", "Scott", "Scott2", np.nan, "NULL"],
                   "AGE":
                   [12, 33, 33, 56,  np.nan, "NULL"],
                   "weight__":
                   [32.5, 47.1, 47.1, 55.2, np.nan, "NULL"],
                   "Admission Date":
                   ["2020-01-01", "2020-01-15", "2020-01-15",
                    "2020-09-01", pd.NaT, "NULL"],
                   "email_address":
                   ["abby@gmail.com","scott@gmail.com", "scott@gmail.com", "test@abc.com", np.nan, "NULL"],
                   "Country of Birth":
                   ["CA","Canada", "Canada", "NULL", np.nan, "NULL"],
                   "Contact (Numbers)":
                   ["1-789-456-0123","1-123-456-7890","1-123-456-7890","1-456-123-7890", np.nan, "NULL" ],

})
df

[1]:

	Name	AGE	weight__	Admission Date	email_address	Country of Birth	Contact (Numbers)
0	Abby	12	32.5	2020-01-01	abby@gmail.com	CA	1-789-456-0123
1	Scott	33	47.1	2020-01-15	scott@gmail.com	Canada	1-123-456-7890
2	Scott	33	47.1	2020-01-15	scott@gmail.com	Canada	1-123-456-7890
3	Scott2	56	55.2	2020-09-01	test@abc.com	NULL	1-456-123-7890
4	NaN	NaN	NaN	NaT	NaN	NaN	NaN
5	NULL	NULL	NULL	NULL	NULL	NULL	NULL

1. Default `clean_df()`¶

By default, clean_df() will always return a tuple of two DataFrames, with one for data types and another one for the cleaned DataFrame. The default parameters are set to be: * clean_headers: True * data_type_detection: “semantic” * standardize_missing_values: “fill” * remove_duplicate_entries: False * downcast_memory: True

[2]:

from dataprep.clean import clean_df
inferred_dtypes, cleaned_df = clean_df(df)

Data Type Detection Report:
        These data types are supported by DataPrep to clean: ['email', 'phone', 'country']
Column Headers Cleaning Report:
        6 values cleaned (85.71%)
Downcast Memory Report:
        Memory reducted from 1939 to 1873. New size: (96.6%)

[3]:

inferred_dtypes

[3]:

	semantic_data_type	atomic_data_type
Name	string	string
AGE	integer	integer
weight__	floating	floating
Admission Date	string	string
email_address	email	string
Country of Birth	country	string
Contact (Numbers)	phone	string

The first returned DataFrame presents the semantic data types for each variable. Note that when using “semantic”, the corresponding atomic results will also be returned.

[4]:

cleaned_df

[4]:

	name	age	weight	admission_date	email_address	country_of_birth	contact_numbers
0	Abby	12	32.500000	2020-01-01	abby@gmail.com	CA	1-789-456-0123
1	Scott	33	47.099998	2020-01-15	scott@gmail.com	Canada	1-123-456-7890
2	Scott	33	47.099998	2020-01-15	scott@gmail.com	Canada	1-123-456-7890
3	Scott2	56	55.200001	2020-09-01	test@abc.com	<NA>	1-456-123-7890
4	<NA>	<NA>	NaN	<NA>	<NA>	<NA>	<NA>
5	<NA>	<NA>	NaN	<NA>	<NA>	<NA>	<NA>

The second returned DataFrame presents the cleaned DataFrame.

Note that the floating columns (e.g., “weight”) may look to lose some information by downcasting memory. Actually nowhere is precision lost, it is due to how the numbers get displayed in Python (more explanations are available in section 5).

2. `data_type_detection` parameter¶

semantic¶

Consider the efficiency on medium or large datasets (>=10,000 data entries), the function will randomly query a subset of valid data entries to best infer the semantic data types. Currently, clean_df() supports to detect the following semantic data types, which all have the corresponding clean_$() function for users to call. * Country Names (country) * Email Addresses (email) * Geographic Coordinates (coordinate) * IP Address (ip) * Phone Numbers (phone) * URLs (url) * US Street Addresses (address) * …more types will be supported with the development

Note that “semantic” will also generate a report to show the semantic data types that are supported by DataPrep to further clean.

[5]:

inferred_dtypes, _ = clean_df(df, data_type_detection = "semantic")
inferred_dtypes

Data Type Detection Report:
        These data types are supported by DataPrep to clean: ['email', 'phone', 'country']
Column Headers Cleaning Report:
        6 values cleaned (85.71%)
Downcast Memory Report:
        Memory reducted from 1939 to 1873. New size: (96.6%)

[5]:

	semantic_data_type	atomic_data_type
Name	string	string
AGE	integer	integer
weight__	floating	floating
Admission Date	string	string
email_address	email	string
Country of Birth	country	string
Contact (Numbers)	phone	string

atomic¶

The “atomic” option can give users an overview of the data types. Note that “atomic” does not generate a report.

[6]:

inferred_dtypes, _ = clean_df(df, data_type_detection = "atomic")
inferred_dtypes

Column Headers Cleaning Report:
        6 values cleaned (85.71%)
Downcast Memory Report:
        Memory reducted from 1939 to 1873. New size: (96.6%)

[6]:

	atomic_data_type
Name	string
AGE	integer
weight__	floating
Admission Date	string
email_address	string
Country of Birth	string
Contact (Numbers)	string

none¶

If “none”, then no data type detection will be performed and clean_df() will return a single cleaned DataFrame only.

[7]:

cleaned_df = clean_df(df, data_type_detection = "none")
cleaned_df

Column Headers Cleaning Report:
        6 values cleaned (85.71%)
Downcast Memory Report:
        Memory reducted from 1939 to 1873. New size: (96.6%)

[7]:

	name	age	weight	admission_date	email_address	country_of_birth	contact_numbers
0	Abby	12	32.500000	2020-01-01	abby@gmail.com	CA	1-789-456-0123
1	Scott	33	47.099998	2020-01-15	scott@gmail.com	Canada	1-123-456-7890
2	Scott	33	47.099998	2020-01-15	scott@gmail.com	Canada	1-123-456-7890
3	Scott2	56	55.200001	2020-09-01	test@abc.com	<NA>	1-456-123-7890
4	<NA>	<NA>	NaN	<NA>	<NA>	<NA>	<NA>
5	<NA>	<NA>	NaN	<NA>	<NA>	<NA>	<NA>

3. `standardize_missing_values`: “remove”¶

This section demonstrates how the “remove” option works. It removes all rows with any missing values and returns a complete DataFrame.

[8]:

_, cleaned_df = clean_df(df, standardize_missing_values = "remove")
cleaned_df

Data Type Detection Report:
        These data types are supported by DataPrep to clean: ['email', 'phone', 'country']
Column Headers Cleaning Report:
        6 values cleaned (85.71%)
Number of Entries Cleaning Report:
        3 entries dropped (50.0%)
Downcast Memory Report:
        Memory reducted from 1074 to 1041. New size: (96.93%)

[8]:

	name	age	weight	admission_date	email_address	country_of_birth	contact_numbers
0	Abby	12	32.500000	2020-01-01	abby@gmail.com	CA	1-789-456-0123
1	Scott	33	47.099998	2020-01-15	scott@gmail.com	Canada	1-123-456-7890
2	Scott	33	47.099998	2020-01-15	scott@gmail.com	Canada	1-123-456-7890

4. `remove_duplicate_entries` parameter¶

[9]:

_, cleaned_df = clean_df(df, remove_duplicate_entries = True)
cleaned_df

Data Type Detection Report:
        These data types are supported by DataPrep to clean: ['email', 'phone', 'country']
Column Headers Cleaning Report:
        6 values cleaned (85.71%)
Number of Entries Cleaning Report:
        2 entries dropped (33.33%)
Downcast Memory Report:
        Memory reducted from 1274 to 1230. New size: (96.55%)

[9]:

	name	age	weight	admission_date	email_address	country_of_birth	contact_numbers
0	Abby	12	32.500000	2020-01-01	abby@gmail.com	CA	1-789-456-0123
1	Scott	33	47.099998	2020-01-15	scott@gmail.com	Canada	1-123-456-7890
2	Scott2	56	55.200001	2020-09-01	test@abc.com	<NA>	1-456-123-7890
3	<NA>	<NA>	NaN	<NA>	<NA>	<NA>	<NA>

Note that if multiple rows are all standardized in step 3 to be completely missing rows, this remove_duplicate_entries = True will also treat them as duplicated entries and only keep one.

5. `downcast_memory` parameter¶

By default and depending on the environment, Pandas will read numerical data and commonly store them using “int64/float64” types. The storage memory can usually be saved by assigning subtypes (e.g., int8), if the range of all data in a numerical column is not extreme. Also, Pandas stores categorical columns as object. This can be optimized by using the category type for columns with low cardinality.

[10]:

_, cleaned_df = clean_df(df, downcast_memory = True, report = False)
cleaned_df.dtypes

[10]:

name                 string
age                    Int8
weight              float32
admission_date       string
email_address        string
country_of_birth     string
contact_numbers      string
dtype: object

We can see that the numeric columns for cleaned_df use downcasted int8 and float32 types.

However, the floating column “weight” may look to lose some precision. This is due to how Python Numpy displays decimals.

[11]:

cleaned_df['weight']

[11]:

0    32.500000
1    47.099998
2    47.099998
3    55.200001
4          NaN
5          NaN
Name: weight, dtype: float32

The actual values are not influenced. For safety, Numpy tends to use float32 as the downcasted data type.

[12]:

cleaned_df['weight'][1]

[12]:

47.1

Pandas DataFrame¶

Introduction¶

An example dirty dataset¶

1. Default clean_df()¶

2. data_type_detection parameter¶