Pandas DataFrame

Introduction

The function clean_df() performs a set of operations that would be useful for cleaning and standardizing a full Pandas DataFrame.

In general, clean_df() is designed as a function to be called initially when users want to clean a DataFrame. It currently supports the following functionalities via parameters:

  1. clean_headers: aims to clean column headers in a DataFrame and standardizes them in a desired format. It uses the default settings from the function “clean_headers()”, which set column names to “snake” case and strip accents.

  2. data_type_detection: aims to do a column-wise semantic and/or atomic data type detection for the whole DataFrame.

    • "semantic": it helps users identify semantic variables (e.g., email) that are supported by DataPrep validating functions. Hence, users can later call the corresponding cleaning functions on these variables (e.g., “clean_email()”).

    • "atomic": it helps users identify the basic data types that are built into Python (e.g., interger, floating, string).

    • "none": no results will be returned.

  3. standardize_missing_values: aims to standardize all missing entries to be a uniform format.

    • "fill": it sets all detected missing values to be “np.nan” or “pd.NaT”.

    • "remove": it removes all rows with any missing values (hence, return a complete DataFrame).

    • "ignore": no actions will be taken.

  4. remove_duplicate_entries: aims to remove the repetitive rows in the DataFrame.

  5. downcast_memory: aims to downcast the column data types to be the smallest possible, which saves storage memory for the DataFrame.

After cleaning, a report is printed that provides the following information: * The number and percentage of header values that were cleaned from clean_headers. * A list of supported semantic data types by DataPrep from data_type_detection: “semantic”. * How many data entries are removed from standardize_missing_values: “remove” and/or remove_duplicate_entries. * How much memory is saved only from downcast_memory.

The following sections demonstrate the functionality of clean_df().

An example dirty dataset

[1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Name":
                   ["Abby", "Scott", "Scott", "Scott2", np.nan, "NULL"],
                   "AGE":
                   [12, 33, 33, 56,  np.nan, "NULL"],
                   "weight__":
                   [32.5, 47.1, 47.1, 55.2, np.nan, "NULL"],
                   "Admission Date":
                   ["2020-01-01", "2020-01-15", "2020-01-15",
                    "2020-09-01", pd.NaT, "NULL"],
                   "email_address":
                   ["abby@gmail.com","scott@gmail.com", "scott@gmail.com", "test@abc.com", np.nan, "NULL"],
                   "Country of Birth":
                   ["CA","Canada", "Canada", "NULL", np.nan, "NULL"],
                   "Contact (Numbers)":
                   ["1-789-456-0123","1-123-456-7890","1-123-456-7890","1-456-123-7890", np.nan, "NULL" ],

})
df
[1]:
Name AGE weight__ Admission Date email_address Country of Birth Contact (Numbers)
0 Abby 12 32.5 2020-01-01 abby@gmail.com CA 1-789-456-0123
1 Scott 33 47.1 2020-01-15 scott@gmail.com Canada 1-123-456-7890
2 Scott 33 47.1 2020-01-15 scott@gmail.com Canada 1-123-456-7890
3 Scott2 56 55.2 2020-09-01 test@abc.com NULL 1-456-123-7890
4 NaN NaN NaN NaT NaN NaN NaN
5 NULL NULL NULL NULL NULL NULL NULL

1. Default clean_df()

By default, clean_df() will always return a tuple of two DataFrames, with one for data types and another one for the cleaned DataFrame. The default parameters are set to be: * clean_headers: True * data_type_detection: “semantic” * standardize_missing_values: “fill” * remove_duplicate_entries: False * downcast_memory: True

[2]:
from dataprep.clean import clean_df
inferred_dtypes, cleaned_df = clean_df(df)
Data Type Detection Report:
        These data types are supported by DataPrep to clean: ['email', 'phone', 'country']
Column Headers Cleaning Report:
        6 values cleaned (85.71%)
Downcast Memory Report:
        Memory reducted from 1939 to 1873. New size: (96.6%)
[3]:
inferred_dtypes
[3]:
semantic_data_type atomic_data_type
Name string string
AGE integer integer
weight__ floating floating
Admission Date string string
email_address email string
Country of Birth country string
Contact (Numbers) phone string

The first returned DataFrame presents the semantic data types for each variable. Note that when using “semantic”, the corresponding atomic results will also be returned.

[4]:
cleaned_df
[4]:
name age weight admission_date email_address country_of_birth contact_numbers
0 Abby 12 32.500000 2020-01-01 abby@gmail.com CA 1-789-456-0123
1 Scott 33 47.099998 2020-01-15 scott@gmail.com Canada 1-123-456-7890
2 Scott 33 47.099998 2020-01-15 scott@gmail.com Canada 1-123-456-7890
3 Scott2 56 55.200001 2020-09-01 test@abc.com <NA> 1-456-123-7890
4 <NA> <NA> NaN <NA> <NA> <NA> <NA>
5 <NA> <NA> NaN <NA> <NA> <NA> <NA>

The second returned DataFrame presents the cleaned DataFrame.

Note that the floating columns (e.g., “weight”) may look to lose some information by downcasting memory. Actually nowhere is precision lost, it is due to how the numbers get displayed in Python (more explanations are available in section 5).

2. data_type_detection parameter

semantic

Consider the efficiency on medium or large datasets (>=10,000 data entries), the function will randomly query a subset of valid data entries to best infer the semantic data types. Currently, clean_df() supports to detect the following semantic data types, which all have the corresponding clean_$() function for users to call. * Country Names (country) * Email Addresses (email) * Geographic Coordinates (coordinate) * IP Address (ip) * Phone Numbers (phone) * URLs (url) * US Street Addresses (address) * …more types will be supported with the development

Note that “semantic” will also generate a report to show the semantic data types that are supported by DataPrep to further clean.

[5]:
inferred_dtypes, _ = clean_df(df, data_type_detection = "semantic")
inferred_dtypes
Data Type Detection Report:
        These data types are supported by DataPrep to clean: ['email', 'phone', 'country']
Column Headers Cleaning Report:
        6 values cleaned (85.71%)
Downcast Memory Report:
        Memory reducted from 1939 to 1873. New size: (96.6%)
[5]:
semantic_data_type atomic_data_type
Name string string
AGE integer integer
weight__ floating floating
Admission Date string string
email_address email string
Country of Birth country string
Contact (Numbers) phone string

atomic

The “atomic” option can give users an overview of the data types. Note that “atomic” does not generate a report.

[6]:
inferred_dtypes, _ = clean_df(df, data_type_detection = "atomic")
inferred_dtypes
Column Headers Cleaning Report:
        6 values cleaned (85.71%)
Downcast Memory Report:
        Memory reducted from 1939 to 1873. New size: (96.6%)
[6]:
atomic_data_type
Name string
AGE integer
weight__ floating
Admission Date string
email_address string
Country of Birth string
Contact (Numbers) string

none

If “none”, then no data type detection will be performed and clean_df() will return a single cleaned DataFrame only.

[7]:
cleaned_df = clean_df(df, data_type_detection = "none")
cleaned_df
Column Headers Cleaning Report:
        6 values cleaned (85.71%)
Downcast Memory Report:
        Memory reducted from 1939 to 1873. New size: (96.6%)
[7]:
name age weight admission_date email_address country_of_birth contact_numbers
0 Abby 12 32.500000 2020-01-01 abby@gmail.com CA 1-789-456-0123
1 Scott 33 47.099998 2020-01-15 scott@gmail.com Canada 1-123-456-7890
2 Scott 33 47.099998 2020-01-15 scott@gmail.com Canada 1-123-456-7890
3 Scott2 56 55.200001 2020-09-01 test@abc.com <NA> 1-456-123-7890
4 <NA> <NA> NaN <NA> <NA> <NA> <NA>
5 <NA> <NA> NaN <NA> <NA> <NA> <NA>

3. standardize_missing_values: “remove”

This section demonstrates how the “remove” option works. It removes all rows with any missing values and returns a complete DataFrame.

[8]:
_, cleaned_df = clean_df(df, standardize_missing_values = "remove")
cleaned_df
Data Type Detection Report:
        These data types are supported by DataPrep to clean: ['email', 'phone', 'country']
Column Headers Cleaning Report:
        6 values cleaned (85.71%)
Number of Entries Cleaning Report:
        3 entries dropped (50.0%)
Downcast Memory Report:
        Memory reducted from 1074 to 1041. New size: (96.93%)
[8]:
name age weight admission_date email_address country_of_birth contact_numbers
0 Abby 12 32.500000 2020-01-01 abby@gmail.com CA 1-789-456-0123
1 Scott 33 47.099998 2020-01-15 scott@gmail.com Canada 1-123-456-7890
2 Scott 33 47.099998 2020-01-15 scott@gmail.com Canada 1-123-456-7890

4. remove_duplicate_entries parameter

[9]:
_, cleaned_df = clean_df(df, remove_duplicate_entries = True)
cleaned_df
Data Type Detection Report:
        These data types are supported by DataPrep to clean: ['email', 'phone', 'country']
Column Headers Cleaning Report:
        6 values cleaned (85.71%)
Number of Entries Cleaning Report:
        2 entries dropped (33.33%)
Downcast Memory Report:
        Memory reducted from 1274 to 1230. New size: (96.55%)
[9]:
name age weight admission_date email_address country_of_birth contact_numbers
0 Abby 12 32.500000 2020-01-01 abby@gmail.com CA 1-789-456-0123
1 Scott 33 47.099998 2020-01-15 scott@gmail.com Canada 1-123-456-7890
2 Scott2 56 55.200001 2020-09-01 test@abc.com <NA> 1-456-123-7890
3 <NA> <NA> NaN <NA> <NA> <NA> <NA>

Note that if multiple rows are all standardized in step 3 to be completely missing rows, this remove_duplicate_entries = True will also treat them as duplicated entries and only keep one.

5. downcast_memory parameter

By default and depending on the environment, Pandas will read numerical data and commonly store them using “int64/float64” types. The storage memory can usually be saved by assigning subtypes (e.g., int8), if the range of all data in a numerical column is not extreme. Also, Pandas stores categorical columns as object. This can be optimized by using the category type for columns with low cardinality.

[10]:
_, cleaned_df = clean_df(df, downcast_memory = True, report = False)
cleaned_df.dtypes
[10]:
name                 string
age                    Int8
weight              float32
admission_date       string
email_address        string
country_of_birth     string
contact_numbers      string
dtype: object

We can see that the numeric columns for cleaned_df use downcasted int8 and float32 types.

However, the floating column “weight” may look to lose some precision. This is due to how Python Numpy displays decimals.

[11]:
cleaned_df['weight']
[11]:
0    32.500000
1    47.099998
2    47.099998
3    55.200001
4          NaN
5          NaN
Name: weight, dtype: float32

The actual values are not influenced. For safety, Numpy tends to use float32 as the downcasted data type.

[12]:
cleaned_df['weight'][1]
[12]:
47.1