The function clean_df() performs a set of operations that would be useful for cleaning and standardizing a full Pandas DataFrame.
clean_df()
In general, clean_df() is designed as a function to be called initially when users want to clean a DataFrame. It currently supports the following functionalities via parameters:
clean_headers: aims to clean column headers in a DataFrame and standardizes them in a desired format. It uses the default settings from the function “clean_headers()”, which set column names to “snake” case and strip accents.
clean_headers
data_type_detection: aims to do a column-wise semantic and/or atomic data type detection for the whole DataFrame.
data_type_detection
"semantic": it helps users identify semantic variables (e.g., email) that are supported by DataPrep validating functions. Hence, users can later call the corresponding cleaning functions on these variables (e.g., “clean_email()”).
"semantic"
"atomic": it helps users identify the basic data types that are built into Python (e.g., interger, floating, string).
"atomic"
"none": no results will be returned.
"none"
standardize_missing_values: aims to standardize all missing entries to be a uniform format.
standardize_missing_values
"fill": it sets all detected missing values to be “np.nan” or “pd.NaT”.
"fill"
"remove": it removes all rows with any missing values (hence, return a complete DataFrame).
"remove"
"ignore": no actions will be taken.
"ignore"
remove_duplicate_entries: aims to remove the repetitive rows in the DataFrame.
remove_duplicate_entries
downcast_memory: aims to downcast the column data types to be the smallest possible, which saves storage memory for the DataFrame.
downcast_memory
After cleaning, a report is printed that provides the following information: * The number and percentage of header values that were cleaned from clean_headers. * A list of supported semantic data types by DataPrep from data_type_detection: “semantic”. * How many data entries are removed from standardize_missing_values: “remove” and/or remove_duplicate_entries. * How much memory is saved only from downcast_memory.
The following sections demonstrate the functionality of clean_df().
[1]:
import pandas as pd import numpy as np df = pd.DataFrame({"Name": ["Abby", "Scott", "Scott", "Scott2", np.nan, "NULL"], "AGE": [12, 33, 33, 56, np.nan, "NULL"], "weight__": [32.5, 47.1, 47.1, 55.2, np.nan, "NULL"], "Admission Date": ["2020-01-01", "2020-01-15", "2020-01-15", "2020-09-01", pd.NaT, "NULL"], "email_address": ["abby@gmail.com","scott@gmail.com", "scott@gmail.com", "test@abc.com", np.nan, "NULL"], "Country of Birth": ["CA","Canada", "Canada", "NULL", np.nan, "NULL"], "Contact (Numbers)": ["1-789-456-0123","1-123-456-7890","1-123-456-7890","1-456-123-7890", np.nan, "NULL" ], }) df
By default, clean_df() will always return a tuple of two DataFrames, with one for data types and another one for the cleaned DataFrame. The default parameters are set to be: * clean_headers: True * data_type_detection: “semantic” * standardize_missing_values: “fill” * remove_duplicate_entries: False * downcast_memory: True
[2]:
from dataprep.clean import clean_df inferred_dtypes, cleaned_df = clean_df(df)
Data Type Detection Report: These data types are supported by DataPrep to clean: ['email', 'phone', 'country'] Column Headers Cleaning Report: 6 values cleaned (85.71%) Downcast Memory Report: Memory reducted from 1939 to 1873. New size: (96.6%)
[3]:
inferred_dtypes
The first returned DataFrame presents the semantic data types for each variable. Note that when using “semantic”, the corresponding atomic results will also be returned.
[4]:
cleaned_df
The second returned DataFrame presents the cleaned DataFrame.
Note that the floating columns (e.g., “weight”) may look to lose some information by downcasting memory. Actually nowhere is precision lost, it is due to how the numbers get displayed in Python (more explanations are available in section 5).
Consider the efficiency on medium or large datasets (>=10,000 data entries), the function will randomly query a subset of valid data entries to best infer the semantic data types. Currently, clean_df() supports to detect the following semantic data types, which all have the corresponding clean_$() function for users to call. * Country Names (country) * Email Addresses (email) * Geographic Coordinates (coordinate) * IP Address (ip) * Phone Numbers (phone) * URLs (url) * US Street Addresses (address) * …more types will be supported with the development
clean_$()
country
email
coordinate
ip
phone
url
address
Note that “semantic” will also generate a report to show the semantic data types that are supported by DataPrep to further clean.
[5]:
inferred_dtypes, _ = clean_df(df, data_type_detection = "semantic") inferred_dtypes
The “atomic” option can give users an overview of the data types. Note that “atomic” does not generate a report.
[6]:
inferred_dtypes, _ = clean_df(df, data_type_detection = "atomic") inferred_dtypes
Column Headers Cleaning Report: 6 values cleaned (85.71%) Downcast Memory Report: Memory reducted from 1939 to 1873. New size: (96.6%)
If “none”, then no data type detection will be performed and clean_df() will return a single cleaned DataFrame only.
[7]:
cleaned_df = clean_df(df, data_type_detection = "none") cleaned_df
This section demonstrates how the “remove” option works. It removes all rows with any missing values and returns a complete DataFrame.
[8]:
_, cleaned_df = clean_df(df, standardize_missing_values = "remove") cleaned_df
Data Type Detection Report: These data types are supported by DataPrep to clean: ['email', 'phone', 'country'] Column Headers Cleaning Report: 6 values cleaned (85.71%) Number of Entries Cleaning Report: 3 entries dropped (50.0%) Downcast Memory Report: Memory reducted from 1074 to 1041. New size: (96.93%)
[9]:
_, cleaned_df = clean_df(df, remove_duplicate_entries = True) cleaned_df
Data Type Detection Report: These data types are supported by DataPrep to clean: ['email', 'phone', 'country'] Column Headers Cleaning Report: 6 values cleaned (85.71%) Number of Entries Cleaning Report: 2 entries dropped (33.33%) Downcast Memory Report: Memory reducted from 1274 to 1230. New size: (96.55%)
Note that if multiple rows are all standardized in step 3 to be completely missing rows, this remove_duplicate_entries = True will also treat them as duplicated entries and only keep one.
remove_duplicate_entries = True
By default and depending on the environment, Pandas will read numerical data and commonly store them using “int64/float64” types. The storage memory can usually be saved by assigning subtypes (e.g., int8), if the range of all data in a numerical column is not extreme. Also, Pandas stores categorical columns as object. This can be optimized by using the category type for columns with low cardinality.
int64
float64
int8
object
category
[10]:
_, cleaned_df = clean_df(df, downcast_memory = True, report = False) cleaned_df.dtypes
name string age Int8 weight float32 admission_date string email_address string country_of_birth string contact_numbers string dtype: object
We can see that the numeric columns for cleaned_df use downcasted int8 and float32 types.
float32
However, the floating column “weight” may look to lose some precision. This is due to how Python Numpy displays decimals.
[11]:
cleaned_df['weight']
0 32.500000 1 47.099998 2 47.099998 3 55.200001 4 NaN 5 NaN Name: weight, dtype: float32
The actual values are not influenced. For safety, Numpy tends to use float32 as the downcasted data type.
[12]:
cleaned_df['weight'][1]
47.1