clean_ml()
The function clean_ml() cleans a dataset for downstreaming machine learning tasks with commonly used operators. It deals with categrical columns and numerical columns sperately. We set the default cleaning pipeline according to existing tools.
Currently, the supported components and operators are listed below:
cat_encoding: encoding categrical columns
cat_encoding
no_encoding
one_hot
cat_imputation: imputing missing values in categorical columns
cat_imputation
constant
most_frequent
drop
num_imputataion : imputing missing values in numerical columns
num_imputataion
mean
median
num_scaling: scaling numerical columns
num_scaling
standarize
minmax
maxabs
variance_threshold: dropping numerical columns with low variance
variance_threshold
Users can also specify include_operators and exclude_operators to include or exclude specified operators listed above. User can also customize the pipeline with user-defined operators.
include_operators
exclude_operators
The example dataset is a very traditional dataset adult. It has 48842 rows and 15 columns. In this dataset, ‘?’ means the missing values.
[4]:
import pandas as pd pd.set_option('display.min_rows', 30) df = pd.read_csv('adult.csv') df
48842 rows × 15 columns
[5]:
training_rate = 0.7 index = df.index number_of_rows = len(index) training_df = df.iloc[:int(training_rate * number_of_rows), :] test_df = df.iloc[int(training_rate * number_of_rows):, :]
[6]:
training_df
34189 rows × 15 columns
[7]:
test_df
14653 rows × 15 columns
By default, the cleaning pipeline of clean_ml() function: * For categorical columns: constant imputation -> one-hot encoding * For numerical columns: mean imputation -> standardzation
constant imputation -> one-hot encoding
mean imputation -> standardzation
The default NULL values are: {np.nan, float("NaN"), "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#QNAN", "<NA>", "N/A", "NA", "NULL", "NaN", "n/a", "nan", "null", "", None}
{np.nan, float("NaN"), "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#QNAN", "<NA>", "N/A", "NA", "NULL", "NaN", "n/a", "nan", "null", "", None}
The default filling value for categorical columns is ‘missing_value’
[8]:
from dataprep.clean import clean_ml cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class")
[9]:
cleaned_training_df
[10]:
cleaned_test_df
cat_null_value
There are three choices for cat_imputation parameter: * constant: filling the missing value with constant values. The default is ‘missing_value’. * most_frequent: filling the missing value with most frequent value of this column. * drop: drop this column if there are missing values.
cat_null_value parameter is a list including user-specified null values. The element in this list can be any type. For example: * [‘?’] * [‘abc’, np.nan, ‘?’, 1265]
By default, the specified missing values are replaced by “missing_value”
[18]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class", cat_imputation="constant", cat_encoding="no_encoding", cat_null_value=['?'])
[19]:
[20]:
[21]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class", cat_imputation="most_frequent", cat_encoding="no_encoding", cat_null_value=['?'])
[22]:
[23]:
[24]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class", cat_imputation="drop", cat_encoding="no_encoding", cat_null_value=['?'])
[25]:
34189 rows × 12 columns
[26]:
14653 rows × 12 columns
fill_val
By default, the filling value of categorical missing value is “missing value”. However, user can specify this string with whatever string they like, such as "missing", "NaN", "I'm a cat.", "Fyodor Dostoyevsky".
"missing"
"NaN"
"I'm a cat."
"Fyodor Dostoyevsky"
[30]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class", cat_null_value=['?'], cat_encoding="no_encoding", fill_val="AHAHAHAHAHA!!!")
[31]:
[32]:
num_imputation
num_null_value
There are three choices for num_imputation parameter: * mean: filling the missing value with mean value of this column. * meduab: filling the missing value with median value of this column. * most_frequent: filling the missing value with most frequent value of this column. * drop: drop this column if there are missing values.
meduab
The default null values are same to the null values metioned in cat_imputation parameter.
The imputing process is quite similar with the cat_imputation parameter section. Thus, we don’t show redundant examples here.
num_null_value parameter is a list including user-specified null values. The element in this list can be any type. For example: * [‘?’] * [‘abc’, np.nan, ‘?’, 1265]
The usage of num_null_value parameter is same to cat_null_value parameter. Thus we don’t show redundant examples here.
There are three choices for cat_encoding parameter: * no_encoding: don’t do any encoding for categorical columns. * one_hot: do one_hot encoding for categorical columns.
The default value is one_hot.
[36]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class", cat_encoding="no_encoding")
[37]:
[38]:
[39]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class", cat_encoding="one_hot")
[40]:
[41]:
variance
There are two choices for variance_threshold parameter: * True: filtering numerical columns whose variance is less than the variance value. * False: do nothing
True
False
The default variance_threshold is False.
The default variance is 0.0.
[42]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class", variance_threshold=True, variance=6.0)
[43]:
34189 rows × 11 columns
[44]:
14653 rows × 11 columns
There are three choices for num_scaling parameter: * standardize: standarding each numerical column with mean value and std value of this column. The transformation is (x - mean) / std. * minmax: scaling each numerical column with min value and max value of this column. The transformation is (x - min) / (max - min) * maxabs: scaling each numerical column with max absolute value of this column. The transformation is x / maxabs.
standardize
The default num_scaling is standardize.
[55]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class", cat_encoding='no_encoding', num_scaling="minmax")
[56]:
[57]:
The include_operators indicates which operator must be included in the cleaning pipeline. It is a list. For example: * ['one_hot', 'minmax', 'median', 'most_frequent']
['one_hot', 'minmax', 'median', 'most_frequent']
The exclude_operators indicates which operator must be excluded in the cleaning pipeline. It has the same format with include_operators.
The valid choices for include_operators and exclude_operators: * one_hot * constant * most_frequent * drop * mean * median * standardize * minmax * maxabs
customized_cat_pipeline
customized_num_pipeline
Experienced users can specify their own customized_cat_pipeline and customized_num_pipeline. The two parameters are lists including dictionaries of each component. Each compontent is also a dictionary including the name of specified operator and related parameters. For example: * [ {"cat_imputation": {"operator": 'constant', "cat_null_value": ['?'], "fill_val": "Hahahaha!!!!!"}}, ]
[ {"cat_imputation": {"operator": 'constant', "cat_null_value": ['?'], "fill_val": "Hahahaha!!!!!"}}, ]
Users can also specifiy their own operators. They just need to define a typical class with the __init__ function, the fit, transform and fit_transform functions. When using them, the name of the class can be put at the operator’s position.
__init__
fit
transform
fit_transform
[58]:
from typing import Any, Union import dask.dataframe as dd import pandas as pd import numpy as np class MaxAbsScaler: def __init__(self) -> None: self.name = "minmaxScaler" def fit(self, df: pd.Series) -> Any: self.maxabs = df.abs().max() return self def transform(self, df: pd.Series) -> pd.Series: result = df.map(self.compute_val) return result def fit_transform(self, df: pd.Series) -> pd.Series: return self.fit(df).transform(df) def compute_val(self, val): return val / self.maxabs customized_cat_pipeline = [ {"cat_imputation": {"operator": 'constant', "cat_null_value": ['?'], "fill_val": "Hahahaha!!!!!"}}, ] customized_num_pipeline = [ {"num_scaling": {"operator": MaxAbsScaler}}, ] cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, customized_cat_pipeline=customized_cat_pipeline, customized_num_pipeline=customized_num_pipeline)
[59]:
[60]:
[ ]: