clean_ml(): Clean dataset for downstreaming machine learning tasks.

Introduction

The function clean_ml() cleans a dataset for downstreaming machine learning tasks with commonly used operators. It deals with categrical columns and numerical columns sperately. We set the default cleaning pipeline according to existing tools.

Currently, the supported components and operators are listed below:

  • cat_encoding: encoding categrical columns

    • no_encoding

    • one_hot

  • cat_imputation: imputing missing values in categorical columns

    • constant

    • most_frequent

    • drop

  • num_imputataion : imputing missing values in numerical columns

    • mean

    • median

    • most_frequent

    • drop

  • num_scaling: scaling numerical columns

    • standarize

    • minmax

    • maxabs

  • variance_threshold: dropping numerical columns with low variance

Users can also specify include_operators and exclude_operators to include or exclude specified operators listed above. User can also customize the pipeline with user-defined operators.

An example dataset

The example dataset is a very traditional dataset adult. It has 48842 rows and 15 columns. In this dataset, ‘?’ means the missing values.

[4]:
import pandas as pd
pd.set_option('display.min_rows', 30)
df = pd.read_csv('adult.csv')
df
[4]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
0 2 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 1 0 2 United-States <=50K
1 3 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 0 United-States <=50K
2 2 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 2 United-States <=50K
3 3 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 2 United-States <=50K
4 1 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 2 Cuba <=50K
5 2 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 2 United-States <=50K
6 3 Private 160187 9th 5 Married-spouse-absent Other-service Not-in-family Black Female 0 0 0 Jamaica <=50K
7 3 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 2 United-States >50K
8 1 Private 45781 Masters 14 Never-married Prof-specialty Not-in-family White Female 4 0 3 United-States >50K
9 2 Private 159449 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 2 0 2 United-States >50K
10 2 Private 280464 Some-college 10 Married-civ-spouse Exec-managerial Husband Black Male 0 0 4 United-States >50K
11 1 State-gov 141297 Bachelors 13 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 2 India >50K
12 0 Private 122272 Bachelors 13 Never-married Adm-clerical Own-child White Female 0 0 1 United-States <=50K
13 1 Private 205019 Assoc-acdm 12 Never-married Sales Not-in-family Black Male 0 0 3 United-States <=50K
14 2 Private 121772 Assoc-voc 11 Married-civ-spouse Craft-repair Husband Asian-Pac-Islander Male 0 0 2 ? >50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48827 3 Private 224655 HS-grad 9 Separated Priv-house-serv Not-in-family White Female 0 0 1 United-States <=50K
48828 2 Private 247547 Assoc-voc 11 Never-married Adm-clerical Unmarried Black Female 0 0 2 United-States <=50K
48829 4 Private 292710 Assoc-acdm 12 Divorced Prof-specialty Not-in-family White Male 0 0 2 United-States <=50K
48830 1 Private 173449 HS-grad 9 Married-civ-spouse Handlers-cleaners Husband White Male 0 0 2 United-States <=50K
48831 3 Private 285570 HS-grad 9 Married-civ-spouse Adm-clerical Husband White Male 0 0 2 United-States <=50K
48832 4 Private 89686 HS-grad 9 Married-civ-spouse Sales Husband White Male 0 0 3 United-States <=50K
48833 1 Private 440129 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 2 United-States <=50K
48834 0 Private 350977 HS-grad 9 Never-married Other-service Own-child White Female 0 0 2 United-States <=50K
48835 3 Local-gov 349230 Masters 14 Divorced Other-service Not-in-family White Male 0 0 2 United-States <=50K
48836 1 Private 245211 Bachelors 13 Never-married Prof-specialty Own-child White Male 0 0 2 United-States <=50K
48837 2 Private 215419 Bachelors 13 Divorced Prof-specialty Not-in-family White Female 0 0 2 United-States <=50K
48838 4 ? 321403 HS-grad 9 Widowed ? Other-relative Black Male 0 0 2 United-States <=50K
48839 2 Private 374983 Bachelors 13 Married-civ-spouse Prof-specialty Husband White Male 0 0 3 United-States <=50K
48840 2 Private 83891 Bachelors 13 Divorced Adm-clerical Own-child Asian-Pac-Islander Male 2 0 2 United-States <=50K
48841 1 Self-emp-inc 182148 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 3 United-States >50K

48842 rows × 15 columns

Split the dataset as training dataframe and test dataframe

[5]:
training_rate = 0.7
index = df.index
number_of_rows = len(index)
training_df = df.iloc[:int(training_rate * number_of_rows), :]
test_df = df.iloc[int(training_rate * number_of_rows):, :]
[6]:
training_df
[6]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
0 2 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 1 0 2 United-States <=50K
1 3 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 0 United-States <=50K
2 2 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 2 United-States <=50K
3 3 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 2 United-States <=50K
4 1 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 2 Cuba <=50K
5 2 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 2 United-States <=50K
6 3 Private 160187 9th 5 Married-spouse-absent Other-service Not-in-family Black Female 0 0 0 Jamaica <=50K
7 3 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 2 United-States >50K
8 1 Private 45781 Masters 14 Never-married Prof-specialty Not-in-family White Female 4 0 3 United-States >50K
9 2 Private 159449 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 2 0 2 United-States >50K
10 2 Private 280464 Some-college 10 Married-civ-spouse Exec-managerial Husband Black Male 0 0 4 United-States >50K
11 1 State-gov 141297 Bachelors 13 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 2 India >50K
12 0 Private 122272 Bachelors 13 Never-married Adm-clerical Own-child White Female 0 0 1 United-States <=50K
13 1 Private 205019 Assoc-acdm 12 Never-married Sales Not-in-family Black Male 0 0 3 United-States <=50K
14 2 Private 121772 Assoc-voc 11 Married-civ-spouse Craft-repair Husband Asian-Pac-Islander Male 0 0 2 ? >50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34174 2 Private 173651 Some-college 10 Married-civ-spouse Prof-specialty Husband White Male 0 0 2 United-States >50K
34175 3 Private 149337 HS-grad 9 Separated Handlers-cleaners Not-in-family White Male 0 0 2 United-States <=50K
34176 4 Private 146674 HS-grad 9 Married-civ-spouse Other-service Husband Black Male 0 0 2 ? >50K
34177 4 Private 173483 Bachelors 13 Divorced Prof-specialty Not-in-family White Female 0 0 1 United-States <=50K
34178 0 Private 223669 11th 7 Never-married Other-service Own-child White Male 0 0 0 United-States <=50K
34179 3 Private 182177 Some-college 10 Divorced Protective-serv Unmarried White Female 0 0 1 United-States <=50K
34180 0 Private 109414 Some-college 10 Never-married Sales Other-relative Asian-Pac-Islander Male 0 0 0 India <=50K
34181 3 Self-emp-inc 150917 Some-college 10 Married-civ-spouse Exec-managerial Husband White Male 0 3 2 United-States >50K
34182 4 Self-emp-not-inc 39128 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 1 United-States <=50K
34183 3 Local-gov 103540 Bachelors 13 Married-civ-spouse Prof-specialty Husband White Male 0 0 3 United-States >50K
34184 4 Private 110212 HS-grad 9 Married-civ-spouse Other-service Husband Black Male 0 0 2 United-States <=50K
34185 2 Private 222450 HS-grad 9 Never-married Sales Not-in-family White Male 0 4 2 El-Salvador <=50K
34186 0 ? 113760 HS-grad 9 Never-married ? Own-child White Female 0 0 2 United-States <=50K
34187 2 ? 253717 11th 7 Married-civ-spouse ? Wife White Female 0 0 0 United-States <=50K
34188 0 Private 306908 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 2 United-States <=50K

34189 rows × 15 columns

[7]:
test_df
[7]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
34189 2 Self-emp-not-inc 263871 Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 3 United-States <=50K
34190 2 State-gov 55294 Bachelors 13 Married-civ-spouse Prof-specialty Husband White Male 0 0 2 United-States >50K
34191 0 Private 174063 Assoc-voc 11 Never-married Other-service Own-child White Female 0 0 0 United-States <=50K
34192 3 State-gov 258735 Some-college 10 Married-civ-spouse Exec-managerial Husband White Male 0 0 2 United-States >50K
34193 3 Private 275867 HS-grad 9 Married-civ-spouse Prof-specialty Husband White Male 0 0 2 United-States <=50K
34194 0 Private 154235 Some-college 10 Never-married Sales Own-child White Female 0 0 1 United-States <=50K
34195 1 Local-gov 210448 Some-college 10 Married-civ-spouse Craft-repair Other-relative White Male 0 0 2 United-States <=50K
34196 1 Private 337908 Some-college 10 Divorced Adm-clerical Unmarried Black Female 0 0 1 United-States <=50K
34197 1 State-gov 205333 Bachelors 13 Never-married Prof-specialty Not-in-family White Female 0 0 0 United-States <=50K
34198 0 Private 187447 Some-college 10 Separated Other-service Own-child White Male 0 0 2 United-States <=50K
34199 1 Private 153589 9th 5 Separated Craft-repair Not-in-family White Male 0 0 2 United-States <=50K
34200 1 Local-gov 149988 Some-college 10 Divorced Adm-clerical Unmarried White Female 0 0 2 United-States <=50K
34201 2 Private 398959 Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 2 United-States <=50K
34202 0 ? 194096 Some-college 10 Never-married ? Own-child White Female 0 0 1 United-States <=50K
34203 2 Private 44041 Assoc-acdm 12 Married-spouse-absent Adm-clerical Other-relative White Male 0 0 3 United-States <=50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48827 3 Private 224655 HS-grad 9 Separated Priv-house-serv Not-in-family White Female 0 0 1 United-States <=50K
48828 2 Private 247547 Assoc-voc 11 Never-married Adm-clerical Unmarried Black Female 0 0 2 United-States <=50K
48829 4 Private 292710 Assoc-acdm 12 Divorced Prof-specialty Not-in-family White Male 0 0 2 United-States <=50K
48830 1 Private 173449 HS-grad 9 Married-civ-spouse Handlers-cleaners Husband White Male 0 0 2 United-States <=50K
48831 3 Private 285570 HS-grad 9 Married-civ-spouse Adm-clerical Husband White Male 0 0 2 United-States <=50K
48832 4 Private 89686 HS-grad 9 Married-civ-spouse Sales Husband White Male 0 0 3 United-States <=50K
48833 1 Private 440129 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 2 United-States <=50K
48834 0 Private 350977 HS-grad 9 Never-married Other-service Own-child White Female 0 0 2 United-States <=50K
48835 3 Local-gov 349230 Masters 14 Divorced Other-service Not-in-family White Male 0 0 2 United-States <=50K
48836 1 Private 245211 Bachelors 13 Never-married Prof-specialty Own-child White Male 0 0 2 United-States <=50K
48837 2 Private 215419 Bachelors 13 Divorced Prof-specialty Not-in-family White Female 0 0 2 United-States <=50K
48838 4 ? 321403 HS-grad 9 Widowed ? Other-relative Black Male 0 0 2 United-States <=50K
48839 2 Private 374983 Bachelors 13 Married-civ-spouse Prof-specialty Husband White Male 0 0 3 United-States <=50K
48840 2 Private 83891 Bachelors 13 Divorced Adm-clerical Own-child Asian-Pac-Islander Male 2 0 2 United-States <=50K
48841 1 Self-emp-inc 182148 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 3 United-States >50K

14653 rows × 15 columns

1. Default clean_ml()

By default, the cleaning pipeline of clean_ml() function: * For categorical columns: constant imputation -> one-hot encoding * For numerical columns: mean imputation -> standardzation

The default NULL values are: {np.nan, float("NaN"), "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#QNAN", "<NA>", "N/A", "NA", "NULL", "NaN", "n/a", "nan", "null", "", None}

The default filling value for categorical columns is ‘missing_value’

[8]:
from dataprep.clean import clean_ml
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class")
[9]:
cleaned_training_df
[9]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
0 0.181564 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.064247 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] 1.054765 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
1 0.955953 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.009237 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 -2.185441 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
2 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.246964 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
3 0.955953 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.428035 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -1.193092 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
4 -0.592825 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1.412302 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 1.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 0.053470 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
5 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.901345 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.520184 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
6 0.955953 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.279485 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... -1.968313 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -2.185441 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
7 0.955953 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.189970 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
8 -0.592825 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.365494 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.520184 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] 5.032415 -0.206016 1.172925 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
9 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.286491 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] 2.380648 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
10 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.862254 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 2.292380 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
11 -0.592825 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.458800 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
12 -1.367214 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.639397 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -1.065986 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
13 -0.592825 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.146086 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... 0.744962 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 1.172925 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
14 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.644143 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ... 0.357352 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... >50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34174 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.151677 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
34175 0.955953 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.382480 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34176 1.730342 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.407759 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... >50K
34177 1.730342 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.153272 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -1.065986 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34178 -1.367214 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.323123 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -1.193092 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 -2.185441 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34179 0.955953 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.070743 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -1.065986 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34180 -1.367214 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.761452 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 1.0] [0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 -2.185441 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34181 0.955953 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0] -0.367482 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 5.199568 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
34182 1.730342 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.428648 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 -1.065986 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34183 0.955953 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] -0.817212 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 1.172925 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
34184 1.730342 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.753877 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34185 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.311551 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 7.001429 0.053470 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34186 -1.367214 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] -0.720198 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34187 0.181564 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] 0.608356 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -1.193092 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -2.185441 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34188 -1.367214 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1.113276 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K

34189 rows × 15 columns

[10]:
cleaned_test_df
[10]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
34189 0.181564 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.704744 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 1.172925 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34190 0.181564 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.275191 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
34191 -1.367214 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.147766 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ... 0.357352 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -2.185441 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34192 0.955953 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.655990 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
34193 0.955953 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.818617 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34194 -1.367214 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.335985 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -1.065986 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34195 -0.592825 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] 0.197621 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34196 -0.592825 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1.407546 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 1.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -1.065986 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34197 -0.592825 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.149067 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -2.185441 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34198 -1.367214 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.020718 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34199 -0.592825 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.342117 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... -1.968313 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34200 -0.592825 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] -0.376300 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34201 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1.987078 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34202 -1.367214 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] 0.042399 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -1.065986 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34203 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.382011 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... 0.744962 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 1.172925 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48827 0.955953 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.332483 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -1.065986 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48828 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.549787 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ... 0.357352 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 1.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48829 1.730342 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.978500 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... 0.744962 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48830 -0.592825 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.153595 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48831 0.955953 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.910723 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48832 1.730342 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.948722 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 1.172925 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48833 -0.592825 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 2.377888 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48834 -1.367214 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1.531605 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48835 0.955953 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] 1.515021 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.520184 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48836 -0.592825 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.527612 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48837 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.244809 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48838 1.730342 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] 1.250871 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 1.0] [0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48839 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1.759484 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 1.172925 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48840 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.003732 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0] 2.380648 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48841 -0.592825 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0] -0.071019 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 1.172925 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K

14653 rows × 15 columns

2. cat_imputation and cat_null_value parameter

There are three choices for cat_imputation parameter: * constant: filling the missing value with constant values. The default is ‘missing_value’. * most_frequent: filling the missing value with most frequent value of this column. * drop: drop this column if there are missing values.

cat_null_value parameter is a list including user-specified null values. The element in this list can be any type. For example: * [‘?’] * [‘abc’, np.nan, ‘?’, 1265]

By default, the specified missing values are replaced by “missing_value”

[18]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class",
                                                cat_imputation="constant",
                                                cat_encoding="no_encoding", cat_null_value=['?'])
[19]:
cleaned_training_df
[19]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
0 0.181564 State-gov -1.064247 Bachelors 1.132573 Never-married Adm-clerical Not-in-family White Male 1.054765 -0.206016 0.053470 United-States <=50K
1 0.955953 Self-emp-not-inc -1.009237 Bachelors 1.132573 Married-civ-spouse Exec-managerial Husband White Male -0.271118 -0.206016 -2.185441 United-States <=50K
2 0.181564 Private 0.246964 HS-grad -0.417870 Divorced Handlers-cleaners Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
3 0.955953 Private 0.428035 11th -1.193092 Married-civ-spouse Handlers-cleaners Husband Black Male -0.271118 -0.206016 0.053470 United-States <=50K
4 -0.592825 Private 1.412302 Bachelors 1.132573 Married-civ-spouse Prof-specialty Wife Black Female -0.271118 -0.206016 0.053470 Cuba <=50K
5 0.181564 Private 0.901345 Masters 1.520184 Married-civ-spouse Exec-managerial Wife White Female -0.271118 -0.206016 0.053470 United-States <=50K
6 0.955953 Private -0.279485 9th -1.968313 Married-spouse-absent Other-service Not-in-family Black Female -0.271118 -0.206016 -2.185441 Jamaica <=50K
7 0.955953 Self-emp-not-inc 0.189970 HS-grad -0.417870 Married-civ-spouse Exec-managerial Husband White Male -0.271118 -0.206016 0.053470 United-States >50K
8 -0.592825 Private -1.365494 Masters 1.520184 Never-married Prof-specialty Not-in-family White Female 5.032415 -0.206016 1.172925 United-States >50K
9 0.181564 Private -0.286491 Bachelors 1.132573 Married-civ-spouse Exec-managerial Husband White Male 2.380648 -0.206016 0.053470 United-States >50K
10 0.181564 Private 0.862254 Some-college -0.030259 Married-civ-spouse Exec-managerial Husband Black Male -0.271118 -0.206016 2.292380 United-States >50K
11 -0.592825 State-gov -0.458800 Bachelors 1.132573 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male -0.271118 -0.206016 0.053470 India >50K
12 -1.367214 Private -0.639397 Bachelors 1.132573 Never-married Adm-clerical Own-child White Female -0.271118 -0.206016 -1.065986 United-States <=50K
13 -0.592825 Private 0.146086 Assoc-acdm 0.744962 Never-married Sales Not-in-family Black Male -0.271118 -0.206016 1.172925 United-States <=50K
14 0.181564 Private -0.644143 Assoc-voc 0.357352 Married-civ-spouse Craft-repair Husband Asian-Pac-Islander Male -0.271118 -0.206016 0.053470 missing_value >50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34174 0.181564 Private -0.151677 Some-college -0.030259 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 0.053470 United-States >50K
34175 0.955953 Private -0.382480 HS-grad -0.417870 Separated Handlers-cleaners Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
34176 1.730342 Private -0.407759 HS-grad -0.417870 Married-civ-spouse Other-service Husband Black Male -0.271118 -0.206016 0.053470 missing_value >50K
34177 1.730342 Private -0.153272 Bachelors 1.132573 Divorced Prof-specialty Not-in-family White Female -0.271118 -0.206016 -1.065986 United-States <=50K
34178 -1.367214 Private 0.323123 11th -1.193092 Never-married Other-service Own-child White Male -0.271118 -0.206016 -2.185441 United-States <=50K
34179 0.955953 Private -0.070743 Some-college -0.030259 Divorced Protective-serv Unmarried White Female -0.271118 -0.206016 -1.065986 United-States <=50K
34180 -1.367214 Private -0.761452 Some-college -0.030259 Never-married Sales Other-relative Asian-Pac-Islander Male -0.271118 -0.206016 -2.185441 India <=50K
34181 0.955953 Self-emp-inc -0.367482 Some-college -0.030259 Married-civ-spouse Exec-managerial Husband White Male -0.271118 5.199568 0.053470 United-States >50K
34182 1.730342 Self-emp-not-inc -1.428648 HS-grad -0.417870 Married-civ-spouse Craft-repair Husband White Male -0.271118 -0.206016 -1.065986 United-States <=50K
34183 0.955953 Local-gov -0.817212 Bachelors 1.132573 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 1.172925 United-States >50K
34184 1.730342 Private -0.753877 HS-grad -0.417870 Married-civ-spouse Other-service Husband Black Male -0.271118 -0.206016 0.053470 United-States <=50K
34185 0.181564 Private 0.311551 HS-grad -0.417870 Never-married Sales Not-in-family White Male -0.271118 7.001429 0.053470 El-Salvador <=50K
34186 -1.367214 missing_value -0.720198 HS-grad -0.417870 Never-married missing_value Own-child White Female -0.271118 -0.206016 0.053470 United-States <=50K
34187 0.181564 missing_value 0.608356 11th -1.193092 Married-civ-spouse missing_value Wife White Female -0.271118 -0.206016 -2.185441 United-States <=50K
34188 -1.367214 Private 1.113276 HS-grad -0.417870 Married-civ-spouse Machine-op-inspct Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K

34189 rows × 15 columns

[20]:
cleaned_test_df
[20]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
34189 0.181564 Self-emp-not-inc 0.704744 Some-college -0.030259 Married-civ-spouse Craft-repair Husband White Male -0.271118 -0.206016 1.172925 United-States <=50K
34190 0.181564 State-gov -1.275191 Bachelors 1.132573 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 0.053470 United-States >50K
34191 -1.367214 Private -0.147766 Assoc-voc 0.357352 Never-married Other-service Own-child White Female -0.271118 -0.206016 -2.185441 United-States <=50K
34192 0.955953 State-gov 0.655990 Some-college -0.030259 Married-civ-spouse Exec-managerial Husband White Male -0.271118 -0.206016 0.053470 United-States >50K
34193 0.955953 Private 0.818617 HS-grad -0.417870 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
34194 -1.367214 Private -0.335985 Some-college -0.030259 Never-married Sales Own-child White Female -0.271118 -0.206016 -1.065986 United-States <=50K
34195 -0.592825 Local-gov 0.197621 Some-college -0.030259 Married-civ-spouse Craft-repair Other-relative White Male -0.271118 -0.206016 0.053470 United-States <=50K
34196 -0.592825 Private 1.407546 Some-college -0.030259 Divorced Adm-clerical Unmarried Black Female -0.271118 -0.206016 -1.065986 United-States <=50K
34197 -0.592825 State-gov 0.149067 Bachelors 1.132573 Never-married Prof-specialty Not-in-family White Female -0.271118 -0.206016 -2.185441 United-States <=50K
34198 -1.367214 Private -0.020718 Some-college -0.030259 Separated Other-service Own-child White Male -0.271118 -0.206016 0.053470 United-States <=50K
34199 -0.592825 Private -0.342117 9th -1.968313 Separated Craft-repair Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
34200 -0.592825 Local-gov -0.376300 Some-college -0.030259 Divorced Adm-clerical Unmarried White Female -0.271118 -0.206016 0.053470 United-States <=50K
34201 0.181564 Private 1.987078 Some-college -0.030259 Married-civ-spouse Craft-repair Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
34202 -1.367214 missing_value 0.042399 Some-college -0.030259 Never-married missing_value Own-child White Female -0.271118 -0.206016 -1.065986 United-States <=50K
34203 0.181564 Private -1.382011 Assoc-acdm 0.744962 Married-spouse-absent Adm-clerical Other-relative White Male -0.271118 -0.206016 1.172925 United-States <=50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48827 0.955953 Private 0.332483 HS-grad -0.417870 Separated Priv-house-serv Not-in-family White Female -0.271118 -0.206016 -1.065986 United-States <=50K
48828 0.181564 Private 0.549787 Assoc-voc 0.357352 Never-married Adm-clerical Unmarried Black Female -0.271118 -0.206016 0.053470 United-States <=50K
48829 1.730342 Private 0.978500 Assoc-acdm 0.744962 Divorced Prof-specialty Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
48830 -0.592825 Private -0.153595 HS-grad -0.417870 Married-civ-spouse Handlers-cleaners Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
48831 0.955953 Private 0.910723 HS-grad -0.417870 Married-civ-spouse Adm-clerical Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
48832 1.730342 Private -0.948722 HS-grad -0.417870 Married-civ-spouse Sales Husband White Male -0.271118 -0.206016 1.172925 United-States <=50K
48833 -0.592825 Private 2.377888 HS-grad -0.417870 Married-civ-spouse Craft-repair Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
48834 -1.367214 Private 1.531605 HS-grad -0.417870 Never-married Other-service Own-child White Female -0.271118 -0.206016 0.053470 United-States <=50K
48835 0.955953 Local-gov 1.515021 Masters 1.520184 Divorced Other-service Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
48836 -0.592825 Private 0.527612 Bachelors 1.132573 Never-married Prof-specialty Own-child White Male -0.271118 -0.206016 0.053470 United-States <=50K
48837 0.181564 Private 0.244809 Bachelors 1.132573 Divorced Prof-specialty Not-in-family White Female -0.271118 -0.206016 0.053470 United-States <=50K
48838 1.730342 missing_value 1.250871 HS-grad -0.417870 Widowed missing_value Other-relative Black Male -0.271118 -0.206016 0.053470 United-States <=50K
48839 0.181564 Private 1.759484 Bachelors 1.132573 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 1.172925 United-States <=50K
48840 0.181564 Private -1.003732 Bachelors 1.132573 Divorced Adm-clerical Own-child Asian-Pac-Islander Male 2.380648 -0.206016 0.053470 United-States <=50K
48841 -0.592825 Self-emp-inc -0.071019 Bachelors 1.132573 Married-civ-spouse Exec-managerial Husband White Male -0.271118 -0.206016 1.172925 United-States >50K

14653 rows × 15 columns

[21]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class",
                                                cat_imputation="most_frequent",
                                                cat_encoding="no_encoding", cat_null_value=['?'])
[22]:
cleaned_training_df
[22]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
0 0.181564 State-gov -1.064247 Bachelors 1.132573 Never-married Adm-clerical Not-in-family White Male 1.054765 -0.206016 0.053470 United-States <=50K
1 0.955953 Self-emp-not-inc -1.009237 Bachelors 1.132573 Married-civ-spouse Exec-managerial Husband White Male -0.271118 -0.206016 -2.185441 United-States <=50K
2 0.181564 Private 0.246964 HS-grad -0.417870 Divorced Handlers-cleaners Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
3 0.955953 Private 0.428035 11th -1.193092 Married-civ-spouse Handlers-cleaners Husband Black Male -0.271118 -0.206016 0.053470 United-States <=50K
4 -0.592825 Private 1.412302 Bachelors 1.132573 Married-civ-spouse Prof-specialty Wife Black Female -0.271118 -0.206016 0.053470 Cuba <=50K
5 0.181564 Private 0.901345 Masters 1.520184 Married-civ-spouse Exec-managerial Wife White Female -0.271118 -0.206016 0.053470 United-States <=50K
6 0.955953 Private -0.279485 9th -1.968313 Married-spouse-absent Other-service Not-in-family Black Female -0.271118 -0.206016 -2.185441 Jamaica <=50K
7 0.955953 Self-emp-not-inc 0.189970 HS-grad -0.417870 Married-civ-spouse Exec-managerial Husband White Male -0.271118 -0.206016 0.053470 United-States >50K
8 -0.592825 Private -1.365494 Masters 1.520184 Never-married Prof-specialty Not-in-family White Female 5.032415 -0.206016 1.172925 United-States >50K
9 0.181564 Private -0.286491 Bachelors 1.132573 Married-civ-spouse Exec-managerial Husband White Male 2.380648 -0.206016 0.053470 United-States >50K
10 0.181564 Private 0.862254 Some-college -0.030259 Married-civ-spouse Exec-managerial Husband Black Male -0.271118 -0.206016 2.292380 United-States >50K
11 -0.592825 State-gov -0.458800 Bachelors 1.132573 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male -0.271118 -0.206016 0.053470 India >50K
12 -1.367214 Private -0.639397 Bachelors 1.132573 Never-married Adm-clerical Own-child White Female -0.271118 -0.206016 -1.065986 United-States <=50K
13 -0.592825 Private 0.146086 Assoc-acdm 0.744962 Never-married Sales Not-in-family Black Male -0.271118 -0.206016 1.172925 United-States <=50K
14 0.181564 Private -0.644143 Assoc-voc 0.357352 Married-civ-spouse Craft-repair Husband Asian-Pac-Islander Male -0.271118 -0.206016 0.053470 United-States >50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34174 0.181564 Private -0.151677 Some-college -0.030259 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 0.053470 United-States >50K
34175 0.955953 Private -0.382480 HS-grad -0.417870 Separated Handlers-cleaners Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
34176 1.730342 Private -0.407759 HS-grad -0.417870 Married-civ-spouse Other-service Husband Black Male -0.271118 -0.206016 0.053470 United-States >50K
34177 1.730342 Private -0.153272 Bachelors 1.132573 Divorced Prof-specialty Not-in-family White Female -0.271118 -0.206016 -1.065986 United-States <=50K
34178 -1.367214 Private 0.323123 11th -1.193092 Never-married Other-service Own-child White Male -0.271118 -0.206016 -2.185441 United-States <=50K
34179 0.955953 Private -0.070743 Some-college -0.030259 Divorced Protective-serv Unmarried White Female -0.271118 -0.206016 -1.065986 United-States <=50K
34180 -1.367214 Private -0.761452 Some-college -0.030259 Never-married Sales Other-relative Asian-Pac-Islander Male -0.271118 -0.206016 -2.185441 India <=50K
34181 0.955953 Self-emp-inc -0.367482 Some-college -0.030259 Married-civ-spouse Exec-managerial Husband White Male -0.271118 5.199568 0.053470 United-States >50K
34182 1.730342 Self-emp-not-inc -1.428648 HS-grad -0.417870 Married-civ-spouse Craft-repair Husband White Male -0.271118 -0.206016 -1.065986 United-States <=50K
34183 0.955953 Local-gov -0.817212 Bachelors 1.132573 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 1.172925 United-States >50K
34184 1.730342 Private -0.753877 HS-grad -0.417870 Married-civ-spouse Other-service Husband Black Male -0.271118 -0.206016 0.053470 United-States <=50K
34185 0.181564 Private 0.311551 HS-grad -0.417870 Never-married Sales Not-in-family White Male -0.271118 7.001429 0.053470 El-Salvador <=50K
34186 -1.367214 Private -0.720198 HS-grad -0.417870 Never-married Prof-specialty Own-child White Female -0.271118 -0.206016 0.053470 United-States <=50K
34187 0.181564 Private 0.608356 11th -1.193092 Married-civ-spouse Prof-specialty Wife White Female -0.271118 -0.206016 -2.185441 United-States <=50K
34188 -1.367214 Private 1.113276 HS-grad -0.417870 Married-civ-spouse Machine-op-inspct Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K

34189 rows × 15 columns

[23]:
cleaned_test_df
[23]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
34189 0.181564 Self-emp-not-inc 0.704744 Some-college -0.030259 Married-civ-spouse Craft-repair Husband White Male -0.271118 -0.206016 1.172925 United-States <=50K
34190 0.181564 State-gov -1.275191 Bachelors 1.132573 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 0.053470 United-States >50K
34191 -1.367214 Private -0.147766 Assoc-voc 0.357352 Never-married Other-service Own-child White Female -0.271118 -0.206016 -2.185441 United-States <=50K
34192 0.955953 State-gov 0.655990 Some-college -0.030259 Married-civ-spouse Exec-managerial Husband White Male -0.271118 -0.206016 0.053470 United-States >50K
34193 0.955953 Private 0.818617 HS-grad -0.417870 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
34194 -1.367214 Private -0.335985 Some-college -0.030259 Never-married Sales Own-child White Female -0.271118 -0.206016 -1.065986 United-States <=50K
34195 -0.592825 Local-gov 0.197621 Some-college -0.030259 Married-civ-spouse Craft-repair Other-relative White Male -0.271118 -0.206016 0.053470 United-States <=50K
34196 -0.592825 Private 1.407546 Some-college -0.030259 Divorced Adm-clerical Unmarried Black Female -0.271118 -0.206016 -1.065986 United-States <=50K
34197 -0.592825 State-gov 0.149067 Bachelors 1.132573 Never-married Prof-specialty Not-in-family White Female -0.271118 -0.206016 -2.185441 United-States <=50K
34198 -1.367214 Private -0.020718 Some-college -0.030259 Separated Other-service Own-child White Male -0.271118 -0.206016 0.053470 United-States <=50K
34199 -0.592825 Private -0.342117 9th -1.968313 Separated Craft-repair Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
34200 -0.592825 Local-gov -0.376300 Some-college -0.030259 Divorced Adm-clerical Unmarried White Female -0.271118 -0.206016 0.053470 United-States <=50K
34201 0.181564 Private 1.987078 Some-college -0.030259 Married-civ-spouse Craft-repair Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
34202 -1.367214 Private 0.042399 Some-college -0.030259 Never-married Prof-specialty Own-child White Female -0.271118 -0.206016 -1.065986 United-States <=50K
34203 0.181564 Private -1.382011 Assoc-acdm 0.744962 Married-spouse-absent Adm-clerical Other-relative White Male -0.271118 -0.206016 1.172925 United-States <=50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48827 0.955953 Private 0.332483 HS-grad -0.417870 Separated Priv-house-serv Not-in-family White Female -0.271118 -0.206016 -1.065986 United-States <=50K
48828 0.181564 Private 0.549787 Assoc-voc 0.357352 Never-married Adm-clerical Unmarried Black Female -0.271118 -0.206016 0.053470 United-States <=50K
48829 1.730342 Private 0.978500 Assoc-acdm 0.744962 Divorced Prof-specialty Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
48830 -0.592825 Private -0.153595 HS-grad -0.417870 Married-civ-spouse Handlers-cleaners Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
48831 0.955953 Private 0.910723 HS-grad -0.417870 Married-civ-spouse Adm-clerical Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
48832 1.730342 Private -0.948722 HS-grad -0.417870 Married-civ-spouse Sales Husband White Male -0.271118 -0.206016 1.172925 United-States <=50K
48833 -0.592825 Private 2.377888 HS-grad -0.417870 Married-civ-spouse Craft-repair Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
48834 -1.367214 Private 1.531605 HS-grad -0.417870 Never-married Other-service Own-child White Female -0.271118 -0.206016 0.053470 United-States <=50K
48835 0.955953 Local-gov 1.515021 Masters 1.520184 Divorced Other-service Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
48836 -0.592825 Private 0.527612 Bachelors 1.132573 Never-married Prof-specialty Own-child White Male -0.271118 -0.206016 0.053470 United-States <=50K
48837 0.181564 Private 0.244809 Bachelors 1.132573 Divorced Prof-specialty Not-in-family White Female -0.271118 -0.206016 0.053470 United-States <=50K
48838 1.730342 Private 1.250871 HS-grad -0.417870 Widowed Prof-specialty Other-relative Black Male -0.271118 -0.206016 0.053470 United-States <=50K
48839 0.181564 Private 1.759484 Bachelors 1.132573 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 1.172925 United-States <=50K
48840 0.181564 Private -1.003732 Bachelors 1.132573 Divorced Adm-clerical Own-child Asian-Pac-Islander Male 2.380648 -0.206016 0.053470 United-States <=50K
48841 -0.592825 Self-emp-inc -0.071019 Bachelors 1.132573 Married-civ-spouse Exec-managerial Husband White Male -0.271118 -0.206016 1.172925 United-States >50K

14653 rows × 15 columns

[24]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class",
                                                cat_imputation="drop",
                                                cat_encoding="no_encoding", cat_null_value=['?'])
[25]:
cleaned_training_df
[25]:
age fnlwgt education education-num marital-status relationship race sex capitalgain capitalloss hoursperweek class
0 0.181564 -1.064247 Bachelors 1.132573 Never-married Not-in-family White Male 1.054765 -0.206016 0.053470 <=50K
1 0.955953 -1.009237 Bachelors 1.132573 Married-civ-spouse Husband White Male -0.271118 -0.206016 -2.185441 <=50K
2 0.181564 0.246964 HS-grad -0.417870 Divorced Not-in-family White Male -0.271118 -0.206016 0.053470 <=50K
3 0.955953 0.428035 11th -1.193092 Married-civ-spouse Husband Black Male -0.271118 -0.206016 0.053470 <=50K
4 -0.592825 1.412302 Bachelors 1.132573 Married-civ-spouse Wife Black Female -0.271118 -0.206016 0.053470 <=50K
5 0.181564 0.901345 Masters 1.520184 Married-civ-spouse Wife White Female -0.271118 -0.206016 0.053470 <=50K
6 0.955953 -0.279485 9th -1.968313 Married-spouse-absent Not-in-family Black Female -0.271118 -0.206016 -2.185441 <=50K
7 0.955953 0.189970 HS-grad -0.417870 Married-civ-spouse Husband White Male -0.271118 -0.206016 0.053470 >50K
8 -0.592825 -1.365494 Masters 1.520184 Never-married Not-in-family White Female 5.032415 -0.206016 1.172925 >50K
9 0.181564 -0.286491 Bachelors 1.132573 Married-civ-spouse Husband White Male 2.380648 -0.206016 0.053470 >50K
10 0.181564 0.862254 Some-college -0.030259 Married-civ-spouse Husband Black Male -0.271118 -0.206016 2.292380 >50K
11 -0.592825 -0.458800 Bachelors 1.132573 Married-civ-spouse Husband Asian-Pac-Islander Male -0.271118 -0.206016 0.053470 >50K
12 -1.367214 -0.639397 Bachelors 1.132573 Never-married Own-child White Female -0.271118 -0.206016 -1.065986 <=50K
13 -0.592825 0.146086 Assoc-acdm 0.744962 Never-married Not-in-family Black Male -0.271118 -0.206016 1.172925 <=50K
14 0.181564 -0.644143 Assoc-voc 0.357352 Married-civ-spouse Husband Asian-Pac-Islander Male -0.271118 -0.206016 0.053470 >50K
... ... ... ... ... ... ... ... ... ... ... ... ...
34174 0.181564 -0.151677 Some-college -0.030259 Married-civ-spouse Husband White Male -0.271118 -0.206016 0.053470 >50K
34175 0.955953 -0.382480 HS-grad -0.417870 Separated Not-in-family White Male -0.271118 -0.206016 0.053470 <=50K
34176 1.730342 -0.407759 HS-grad -0.417870 Married-civ-spouse Husband Black Male -0.271118 -0.206016 0.053470 >50K
34177 1.730342 -0.153272 Bachelors 1.132573 Divorced Not-in-family White Female -0.271118 -0.206016 -1.065986 <=50K
34178 -1.367214 0.323123 11th -1.193092 Never-married Own-child White Male -0.271118 -0.206016 -2.185441 <=50K
34179 0.955953 -0.070743 Some-college -0.030259 Divorced Unmarried White Female -0.271118 -0.206016 -1.065986 <=50K
34180 -1.367214 -0.761452 Some-college -0.030259 Never-married Other-relative Asian-Pac-Islander Male -0.271118 -0.206016 -2.185441 <=50K
34181 0.955953 -0.367482 Some-college -0.030259 Married-civ-spouse Husband White Male -0.271118 5.199568 0.053470 >50K
34182 1.730342 -1.428648 HS-grad -0.417870 Married-civ-spouse Husband White Male -0.271118 -0.206016 -1.065986 <=50K
34183 0.955953 -0.817212 Bachelors 1.132573 Married-civ-spouse Husband White Male -0.271118 -0.206016 1.172925 >50K
34184 1.730342 -0.753877 HS-grad -0.417870 Married-civ-spouse Husband Black Male -0.271118 -0.206016 0.053470 <=50K
34185 0.181564 0.311551 HS-grad -0.417870 Never-married Not-in-family White Male -0.271118 7.001429 0.053470 <=50K
34186 -1.367214 -0.720198 HS-grad -0.417870 Never-married Own-child White Female -0.271118 -0.206016 0.053470 <=50K
34187 0.181564 0.608356 11th -1.193092 Married-civ-spouse Wife White Female -0.271118 -0.206016 -2.185441 <=50K
34188 -1.367214 1.113276 HS-grad -0.417870 Married-civ-spouse Husband White Male -0.271118 -0.206016 0.053470 <=50K

34189 rows × 12 columns

[26]:
cleaned_test_df
[26]:
age fnlwgt education education-num marital-status relationship race sex capitalgain capitalloss hoursperweek class
34189 0.181564 0.704744 Some-college -0.030259 Married-civ-spouse Husband White Male -0.271118 -0.206016 1.172925 <=50K
34190 0.181564 -1.275191 Bachelors 1.132573 Married-civ-spouse Husband White Male -0.271118 -0.206016 0.053470 >50K
34191 -1.367214 -0.147766 Assoc-voc 0.357352 Never-married Own-child White Female -0.271118 -0.206016 -2.185441 <=50K
34192 0.955953 0.655990 Some-college -0.030259 Married-civ-spouse Husband White Male -0.271118 -0.206016 0.053470 >50K
34193 0.955953 0.818617 HS-grad -0.417870 Married-civ-spouse Husband White Male -0.271118 -0.206016 0.053470 <=50K
34194 -1.367214 -0.335985 Some-college -0.030259 Never-married Own-child White Female -0.271118 -0.206016 -1.065986 <=50K
34195 -0.592825 0.197621 Some-college -0.030259 Married-civ-spouse Other-relative White Male -0.271118 -0.206016 0.053470 <=50K
34196 -0.592825 1.407546 Some-college -0.030259 Divorced Unmarried Black Female -0.271118 -0.206016 -1.065986 <=50K
34197 -0.592825 0.149067 Bachelors 1.132573 Never-married Not-in-family White Female -0.271118 -0.206016 -2.185441 <=50K
34198 -1.367214 -0.020718 Some-college -0.030259 Separated Own-child White Male -0.271118 -0.206016 0.053470 <=50K
34199 -0.592825 -0.342117 9th -1.968313 Separated Not-in-family White Male -0.271118 -0.206016 0.053470 <=50K
34200 -0.592825 -0.376300 Some-college -0.030259 Divorced Unmarried White Female -0.271118 -0.206016 0.053470 <=50K
34201 0.181564 1.987078 Some-college -0.030259 Married-civ-spouse Husband White Male -0.271118 -0.206016 0.053470 <=50K
34202 -1.367214 0.042399 Some-college -0.030259 Never-married Own-child White Female -0.271118 -0.206016 -1.065986 <=50K
34203 0.181564 -1.382011 Assoc-acdm 0.744962 Married-spouse-absent Other-relative White Male -0.271118 -0.206016 1.172925 <=50K
... ... ... ... ... ... ... ... ... ... ... ... ...
48827 0.955953 0.332483 HS-grad -0.417870 Separated Not-in-family White Female -0.271118 -0.206016 -1.065986 <=50K
48828 0.181564 0.549787 Assoc-voc 0.357352 Never-married Unmarried Black Female -0.271118 -0.206016 0.053470 <=50K
48829 1.730342 0.978500 Assoc-acdm 0.744962 Divorced Not-in-family White Male -0.271118 -0.206016 0.053470 <=50K
48830 -0.592825 -0.153595 HS-grad -0.417870 Married-civ-spouse Husband White Male -0.271118 -0.206016 0.053470 <=50K
48831 0.955953 0.910723 HS-grad -0.417870 Married-civ-spouse Husband White Male -0.271118 -0.206016 0.053470 <=50K
48832 1.730342 -0.948722 HS-grad -0.417870 Married-civ-spouse Husband White Male -0.271118 -0.206016 1.172925 <=50K
48833 -0.592825 2.377888 HS-grad -0.417870 Married-civ-spouse Husband White Male -0.271118 -0.206016 0.053470 <=50K
48834 -1.367214 1.531605 HS-grad -0.417870 Never-married Own-child White Female -0.271118 -0.206016 0.053470 <=50K
48835 0.955953 1.515021 Masters 1.520184 Divorced Not-in-family White Male -0.271118 -0.206016 0.053470 <=50K
48836 -0.592825 0.527612 Bachelors 1.132573 Never-married Own-child White Male -0.271118 -0.206016 0.053470 <=50K
48837 0.181564 0.244809 Bachelors 1.132573 Divorced Not-in-family White Female -0.271118 -0.206016 0.053470 <=50K
48838 1.730342 1.250871 HS-grad -0.417870 Widowed Other-relative Black Male -0.271118 -0.206016 0.053470 <=50K
48839 0.181564 1.759484 Bachelors 1.132573 Married-civ-spouse Husband White Male -0.271118 -0.206016 1.172925 <=50K
48840 0.181564 -1.003732 Bachelors 1.132573 Divorced Own-child Asian-Pac-Islander Male 2.380648 -0.206016 0.053470 <=50K
48841 -0.592825 -0.071019 Bachelors 1.132573 Married-civ-spouse Husband White Male -0.271118 -0.206016 1.172925 >50K

14653 rows × 12 columns

3. fill_val parameter

By default, the filling value of categorical missing value is “missing value”. However, user can specify this string with whatever string they like, such as "missing", "NaN", "I'm a cat.", "Fyodor Dostoyevsky".

[30]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class",
                                                cat_null_value=['?'], cat_encoding="no_encoding",
                                                fill_val="AHAHAHAHAHA!!!")
[31]:
cleaned_training_df
[31]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
0 0.181564 State-gov -1.064247 Bachelors 1.132573 Never-married Adm-clerical Not-in-family White Male 1.054765 -0.206016 0.053470 United-States <=50K
1 0.955953 Self-emp-not-inc -1.009237 Bachelors 1.132573 Married-civ-spouse Exec-managerial Husband White Male -0.271118 -0.206016 -2.185441 United-States <=50K
2 0.181564 Private 0.246964 HS-grad -0.417870 Divorced Handlers-cleaners Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
3 0.955953 Private 0.428035 11th -1.193092 Married-civ-spouse Handlers-cleaners Husband Black Male -0.271118 -0.206016 0.053470 United-States <=50K
4 -0.592825 Private 1.412302 Bachelors 1.132573 Married-civ-spouse Prof-specialty Wife Black Female -0.271118 -0.206016 0.053470 Cuba <=50K
5 0.181564 Private 0.901345 Masters 1.520184 Married-civ-spouse Exec-managerial Wife White Female -0.271118 -0.206016 0.053470 United-States <=50K
6 0.955953 Private -0.279485 9th -1.968313 Married-spouse-absent Other-service Not-in-family Black Female -0.271118 -0.206016 -2.185441 Jamaica <=50K
7 0.955953 Self-emp-not-inc 0.189970 HS-grad -0.417870 Married-civ-spouse Exec-managerial Husband White Male -0.271118 -0.206016 0.053470 United-States >50K
8 -0.592825 Private -1.365494 Masters 1.520184 Never-married Prof-specialty Not-in-family White Female 5.032415 -0.206016 1.172925 United-States >50K
9 0.181564 Private -0.286491 Bachelors 1.132573 Married-civ-spouse Exec-managerial Husband White Male 2.380648 -0.206016 0.053470 United-States >50K
10 0.181564 Private 0.862254 Some-college -0.030259 Married-civ-spouse Exec-managerial Husband Black Male -0.271118 -0.206016 2.292380 United-States >50K
11 -0.592825 State-gov -0.458800 Bachelors 1.132573 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male -0.271118 -0.206016 0.053470 India >50K
12 -1.367214 Private -0.639397 Bachelors 1.132573 Never-married Adm-clerical Own-child White Female -0.271118 -0.206016 -1.065986 United-States <=50K
13 -0.592825 Private 0.146086 Assoc-acdm 0.744962 Never-married Sales Not-in-family Black Male -0.271118 -0.206016 1.172925 United-States <=50K
14 0.181564 Private -0.644143 Assoc-voc 0.357352 Married-civ-spouse Craft-repair Husband Asian-Pac-Islander Male -0.271118 -0.206016 0.053470 AHAHAHAHAHA!!! >50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34174 0.181564 Private -0.151677 Some-college -0.030259 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 0.053470 United-States >50K
34175 0.955953 Private -0.382480 HS-grad -0.417870 Separated Handlers-cleaners Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
34176 1.730342 Private -0.407759 HS-grad -0.417870 Married-civ-spouse Other-service Husband Black Male -0.271118 -0.206016 0.053470 AHAHAHAHAHA!!! >50K
34177 1.730342 Private -0.153272 Bachelors 1.132573 Divorced Prof-specialty Not-in-family White Female -0.271118 -0.206016 -1.065986 United-States <=50K
34178 -1.367214 Private 0.323123 11th -1.193092 Never-married Other-service Own-child White Male -0.271118 -0.206016 -2.185441 United-States <=50K
34179 0.955953 Private -0.070743 Some-college -0.030259 Divorced Protective-serv Unmarried White Female -0.271118 -0.206016 -1.065986 United-States <=50K
34180 -1.367214 Private -0.761452 Some-college -0.030259 Never-married Sales Other-relative Asian-Pac-Islander Male -0.271118 -0.206016 -2.185441 India <=50K
34181 0.955953 Self-emp-inc -0.367482 Some-college -0.030259 Married-civ-spouse Exec-managerial Husband White Male -0.271118 5.199568 0.053470 United-States >50K
34182 1.730342 Self-emp-not-inc -1.428648 HS-grad -0.417870 Married-civ-spouse Craft-repair Husband White Male -0.271118 -0.206016 -1.065986 United-States <=50K
34183 0.955953 Local-gov -0.817212 Bachelors 1.132573 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 1.172925 United-States >50K
34184 1.730342 Private -0.753877 HS-grad -0.417870 Married-civ-spouse Other-service Husband Black Male -0.271118 -0.206016 0.053470 United-States <=50K
34185 0.181564 Private 0.311551 HS-grad -0.417870 Never-married Sales Not-in-family White Male -0.271118 7.001429 0.053470 El-Salvador <=50K
34186 -1.367214 AHAHAHAHAHA!!! -0.720198 HS-grad -0.417870 Never-married AHAHAHAHAHA!!! Own-child White Female -0.271118 -0.206016 0.053470 United-States <=50K
34187 0.181564 AHAHAHAHAHA!!! 0.608356 11th -1.193092 Married-civ-spouse AHAHAHAHAHA!!! Wife White Female -0.271118 -0.206016 -2.185441 United-States <=50K
34188 -1.367214 Private 1.113276 HS-grad -0.417870 Married-civ-spouse Machine-op-inspct Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K

34189 rows × 15 columns

[32]:
cleaned_test_df
[32]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
34189 0.181564 Self-emp-not-inc 0.704744 Some-college -0.030259 Married-civ-spouse Craft-repair Husband White Male -0.271118 -0.206016 1.172925 United-States <=50K
34190 0.181564 State-gov -1.275191 Bachelors 1.132573 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 0.053470 United-States >50K
34191 -1.367214 Private -0.147766 Assoc-voc 0.357352 Never-married Other-service Own-child White Female -0.271118 -0.206016 -2.185441 United-States <=50K
34192 0.955953 State-gov 0.655990 Some-college -0.030259 Married-civ-spouse Exec-managerial Husband White Male -0.271118 -0.206016 0.053470 United-States >50K
34193 0.955953 Private 0.818617 HS-grad -0.417870 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
34194 -1.367214 Private -0.335985 Some-college -0.030259 Never-married Sales Own-child White Female -0.271118 -0.206016 -1.065986 United-States <=50K
34195 -0.592825 Local-gov 0.197621 Some-college -0.030259 Married-civ-spouse Craft-repair Other-relative White Male -0.271118 -0.206016 0.053470 United-States <=50K
34196 -0.592825 Private 1.407546 Some-college -0.030259 Divorced Adm-clerical Unmarried Black Female -0.271118 -0.206016 -1.065986 United-States <=50K
34197 -0.592825 State-gov 0.149067 Bachelors 1.132573 Never-married Prof-specialty Not-in-family White Female -0.271118 -0.206016 -2.185441 United-States <=50K
34198 -1.367214 Private -0.020718 Some-college -0.030259 Separated Other-service Own-child White Male -0.271118 -0.206016 0.053470 United-States <=50K
34199 -0.592825 Private -0.342117 9th -1.968313 Separated Craft-repair Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
34200 -0.592825 Local-gov -0.376300 Some-college -0.030259 Divorced Adm-clerical Unmarried White Female -0.271118 -0.206016 0.053470 United-States <=50K
34201 0.181564 Private 1.987078 Some-college -0.030259 Married-civ-spouse Craft-repair Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
34202 -1.367214 AHAHAHAHAHA!!! 0.042399 Some-college -0.030259 Never-married AHAHAHAHAHA!!! Own-child White Female -0.271118 -0.206016 -1.065986 United-States <=50K
34203 0.181564 Private -1.382011 Assoc-acdm 0.744962 Married-spouse-absent Adm-clerical Other-relative White Male -0.271118 -0.206016 1.172925 United-States <=50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48827 0.955953 Private 0.332483 HS-grad -0.417870 Separated Priv-house-serv Not-in-family White Female -0.271118 -0.206016 -1.065986 United-States <=50K
48828 0.181564 Private 0.549787 Assoc-voc 0.357352 Never-married Adm-clerical Unmarried Black Female -0.271118 -0.206016 0.053470 United-States <=50K
48829 1.730342 Private 0.978500 Assoc-acdm 0.744962 Divorced Prof-specialty Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
48830 -0.592825 Private -0.153595 HS-grad -0.417870 Married-civ-spouse Handlers-cleaners Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
48831 0.955953 Private 0.910723 HS-grad -0.417870 Married-civ-spouse Adm-clerical Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
48832 1.730342 Private -0.948722 HS-grad -0.417870 Married-civ-spouse Sales Husband White Male -0.271118 -0.206016 1.172925 United-States <=50K
48833 -0.592825 Private 2.377888 HS-grad -0.417870 Married-civ-spouse Craft-repair Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
48834 -1.367214 Private 1.531605 HS-grad -0.417870 Never-married Other-service Own-child White Female -0.271118 -0.206016 0.053470 United-States <=50K
48835 0.955953 Local-gov 1.515021 Masters 1.520184 Divorced Other-service Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
48836 -0.592825 Private 0.527612 Bachelors 1.132573 Never-married Prof-specialty Own-child White Male -0.271118 -0.206016 0.053470 United-States <=50K
48837 0.181564 Private 0.244809 Bachelors 1.132573 Divorced Prof-specialty Not-in-family White Female -0.271118 -0.206016 0.053470 United-States <=50K
48838 1.730342 AHAHAHAHAHA!!! 1.250871 HS-grad -0.417870 Widowed AHAHAHAHAHA!!! Other-relative Black Male -0.271118 -0.206016 0.053470 United-States <=50K
48839 0.181564 Private 1.759484 Bachelors 1.132573 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 1.172925 United-States <=50K
48840 0.181564 Private -1.003732 Bachelors 1.132573 Divorced Adm-clerical Own-child Asian-Pac-Islander Male 2.380648 -0.206016 0.053470 United-States <=50K
48841 -0.592825 Self-emp-inc -0.071019 Bachelors 1.132573 Married-civ-spouse Exec-managerial Husband White Male -0.271118 -0.206016 1.172925 United-States >50K

14653 rows × 15 columns

4. num_imputation and num_null_value parameter

There are three choices for num_imputation parameter: * mean: filling the missing value with mean value of this column. * meduab: filling the missing value with median value of this column. * most_frequent: filling the missing value with most frequent value of this column. * drop: drop this column if there are missing values.

The default null values are same to the null values metioned in cat_imputation parameter.

The imputing process is quite similar with the cat_imputation parameter section. Thus, we don’t show redundant examples here.

num_null_value parameter is a list including user-specified null values. The element in this list can be any type. For example: * [‘?’] * [‘abc’, np.nan, ‘?’, 1265]

The usage of num_null_value parameter is same to cat_null_value parameter. Thus we don’t show redundant examples here.

5. cat_encoding parameter

There are three choices for cat_encoding parameter: * no_encoding: don’t do any encoding for categorical columns. * one_hot: do one_hot encoding for categorical columns.

The default value is one_hot.

[36]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class", cat_encoding="no_encoding")
[37]:
cleaned_training_df
[37]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
0 0.181564 State-gov -1.064247 Bachelors 1.132573 Never-married Adm-clerical Not-in-family White Male 1.054765 -0.206016 0.053470 United-States <=50K
1 0.955953 Self-emp-not-inc -1.009237 Bachelors 1.132573 Married-civ-spouse Exec-managerial Husband White Male -0.271118 -0.206016 -2.185441 United-States <=50K
2 0.181564 Private 0.246964 HS-grad -0.417870 Divorced Handlers-cleaners Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
3 0.955953 Private 0.428035 11th -1.193092 Married-civ-spouse Handlers-cleaners Husband Black Male -0.271118 -0.206016 0.053470 United-States <=50K
4 -0.592825 Private 1.412302 Bachelors 1.132573 Married-civ-spouse Prof-specialty Wife Black Female -0.271118 -0.206016 0.053470 Cuba <=50K
5 0.181564 Private 0.901345 Masters 1.520184 Married-civ-spouse Exec-managerial Wife White Female -0.271118 -0.206016 0.053470 United-States <=50K
6 0.955953 Private -0.279485 9th -1.968313 Married-spouse-absent Other-service Not-in-family Black Female -0.271118 -0.206016 -2.185441 Jamaica <=50K
7 0.955953 Self-emp-not-inc 0.189970 HS-grad -0.417870 Married-civ-spouse Exec-managerial Husband White Male -0.271118 -0.206016 0.053470 United-States >50K
8 -0.592825 Private -1.365494 Masters 1.520184 Never-married Prof-specialty Not-in-family White Female 5.032415 -0.206016 1.172925 United-States >50K
9 0.181564 Private -0.286491 Bachelors 1.132573 Married-civ-spouse Exec-managerial Husband White Male 2.380648 -0.206016 0.053470 United-States >50K
10 0.181564 Private 0.862254 Some-college -0.030259 Married-civ-spouse Exec-managerial Husband Black Male -0.271118 -0.206016 2.292380 United-States >50K
11 -0.592825 State-gov -0.458800 Bachelors 1.132573 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male -0.271118 -0.206016 0.053470 India >50K
12 -1.367214 Private -0.639397 Bachelors 1.132573 Never-married Adm-clerical Own-child White Female -0.271118 -0.206016 -1.065986 United-States <=50K
13 -0.592825 Private 0.146086 Assoc-acdm 0.744962 Never-married Sales Not-in-family Black Male -0.271118 -0.206016 1.172925 United-States <=50K
14 0.181564 Private -0.644143 Assoc-voc 0.357352 Married-civ-spouse Craft-repair Husband Asian-Pac-Islander Male -0.271118 -0.206016 0.053470 ? >50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34174 0.181564 Private -0.151677 Some-college -0.030259 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 0.053470 United-States >50K
34175 0.955953 Private -0.382480 HS-grad -0.417870 Separated Handlers-cleaners Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
34176 1.730342 Private -0.407759 HS-grad -0.417870 Married-civ-spouse Other-service Husband Black Male -0.271118 -0.206016 0.053470 ? >50K
34177 1.730342 Private -0.153272 Bachelors 1.132573 Divorced Prof-specialty Not-in-family White Female -0.271118 -0.206016 -1.065986 United-States <=50K
34178 -1.367214 Private 0.323123 11th -1.193092 Never-married Other-service Own-child White Male -0.271118 -0.206016 -2.185441 United-States <=50K
34179 0.955953 Private -0.070743 Some-college -0.030259 Divorced Protective-serv Unmarried White Female -0.271118 -0.206016 -1.065986 United-States <=50K
34180 -1.367214 Private -0.761452 Some-college -0.030259 Never-married Sales Other-relative Asian-Pac-Islander Male -0.271118 -0.206016 -2.185441 India <=50K
34181 0.955953 Self-emp-inc -0.367482 Some-college -0.030259 Married-civ-spouse Exec-managerial Husband White Male -0.271118 5.199568 0.053470 United-States >50K
34182 1.730342 Self-emp-not-inc -1.428648 HS-grad -0.417870 Married-civ-spouse Craft-repair Husband White Male -0.271118 -0.206016 -1.065986 United-States <=50K
34183 0.955953 Local-gov -0.817212 Bachelors 1.132573 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 1.172925 United-States >50K
34184 1.730342 Private -0.753877 HS-grad -0.417870 Married-civ-spouse Other-service Husband Black Male -0.271118 -0.206016 0.053470 United-States <=50K
34185 0.181564 Private 0.311551 HS-grad -0.417870 Never-married Sales Not-in-family White Male -0.271118 7.001429 0.053470 El-Salvador <=50K
34186 -1.367214 ? -0.720198 HS-grad -0.417870 Never-married ? Own-child White Female -0.271118 -0.206016 0.053470 United-States <=50K
34187 0.181564 ? 0.608356 11th -1.193092 Married-civ-spouse ? Wife White Female -0.271118 -0.206016 -2.185441 United-States <=50K
34188 -1.367214 Private 1.113276 HS-grad -0.417870 Married-civ-spouse Machine-op-inspct Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K

34189 rows × 15 columns

[38]:
cleaned_test_df
[38]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
34189 0.181564 Self-emp-not-inc 0.704744 Some-college -0.030259 Married-civ-spouse Craft-repair Husband White Male -0.271118 -0.206016 1.172925 United-States <=50K
34190 0.181564 State-gov -1.275191 Bachelors 1.132573 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 0.053470 United-States >50K
34191 -1.367214 Private -0.147766 Assoc-voc 0.357352 Never-married Other-service Own-child White Female -0.271118 -0.206016 -2.185441 United-States <=50K
34192 0.955953 State-gov 0.655990 Some-college -0.030259 Married-civ-spouse Exec-managerial Husband White Male -0.271118 -0.206016 0.053470 United-States >50K
34193 0.955953 Private 0.818617 HS-grad -0.417870 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
34194 -1.367214 Private -0.335985 Some-college -0.030259 Never-married Sales Own-child White Female -0.271118 -0.206016 -1.065986 United-States <=50K
34195 -0.592825 Local-gov 0.197621 Some-college -0.030259 Married-civ-spouse Craft-repair Other-relative White Male -0.271118 -0.206016 0.053470 United-States <=50K
34196 -0.592825 Private 1.407546 Some-college -0.030259 Divorced Adm-clerical Unmarried Black Female -0.271118 -0.206016 -1.065986 United-States <=50K
34197 -0.592825 State-gov 0.149067 Bachelors 1.132573 Never-married Prof-specialty Not-in-family White Female -0.271118 -0.206016 -2.185441 United-States <=50K
34198 -1.367214 Private -0.020718 Some-college -0.030259 Separated Other-service Own-child White Male -0.271118 -0.206016 0.053470 United-States <=50K
34199 -0.592825 Private -0.342117 9th -1.968313 Separated Craft-repair Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
34200 -0.592825 Local-gov -0.376300 Some-college -0.030259 Divorced Adm-clerical Unmarried White Female -0.271118 -0.206016 0.053470 United-States <=50K
34201 0.181564 Private 1.987078 Some-college -0.030259 Married-civ-spouse Craft-repair Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
34202 -1.367214 ? 0.042399 Some-college -0.030259 Never-married ? Own-child White Female -0.271118 -0.206016 -1.065986 United-States <=50K
34203 0.181564 Private -1.382011 Assoc-acdm 0.744962 Married-spouse-absent Adm-clerical Other-relative White Male -0.271118 -0.206016 1.172925 United-States <=50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48827 0.955953 Private 0.332483 HS-grad -0.417870 Separated Priv-house-serv Not-in-family White Female -0.271118 -0.206016 -1.065986 United-States <=50K
48828 0.181564 Private 0.549787 Assoc-voc 0.357352 Never-married Adm-clerical Unmarried Black Female -0.271118 -0.206016 0.053470 United-States <=50K
48829 1.730342 Private 0.978500 Assoc-acdm 0.744962 Divorced Prof-specialty Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
48830 -0.592825 Private -0.153595 HS-grad -0.417870 Married-civ-spouse Handlers-cleaners Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
48831 0.955953 Private 0.910723 HS-grad -0.417870 Married-civ-spouse Adm-clerical Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
48832 1.730342 Private -0.948722 HS-grad -0.417870 Married-civ-spouse Sales Husband White Male -0.271118 -0.206016 1.172925 United-States <=50K
48833 -0.592825 Private 2.377888 HS-grad -0.417870 Married-civ-spouse Craft-repair Husband White Male -0.271118 -0.206016 0.053470 United-States <=50K
48834 -1.367214 Private 1.531605 HS-grad -0.417870 Never-married Other-service Own-child White Female -0.271118 -0.206016 0.053470 United-States <=50K
48835 0.955953 Local-gov 1.515021 Masters 1.520184 Divorced Other-service Not-in-family White Male -0.271118 -0.206016 0.053470 United-States <=50K
48836 -0.592825 Private 0.527612 Bachelors 1.132573 Never-married Prof-specialty Own-child White Male -0.271118 -0.206016 0.053470 United-States <=50K
48837 0.181564 Private 0.244809 Bachelors 1.132573 Divorced Prof-specialty Not-in-family White Female -0.271118 -0.206016 0.053470 United-States <=50K
48838 1.730342 ? 1.250871 HS-grad -0.417870 Widowed ? Other-relative Black Male -0.271118 -0.206016 0.053470 United-States <=50K
48839 0.181564 Private 1.759484 Bachelors 1.132573 Married-civ-spouse Prof-specialty Husband White Male -0.271118 -0.206016 1.172925 United-States <=50K
48840 0.181564 Private -1.003732 Bachelors 1.132573 Divorced Adm-clerical Own-child Asian-Pac-Islander Male 2.380648 -0.206016 0.053470 United-States <=50K
48841 -0.592825 Self-emp-inc -0.071019 Bachelors 1.132573 Married-civ-spouse Exec-managerial Husband White Male -0.271118 -0.206016 1.172925 United-States >50K

14653 rows × 15 columns

[39]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class", cat_encoding="one_hot")
[40]:
cleaned_training_df
[40]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
0 0.181564 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.064247 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] 1.054765 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
1 0.955953 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.009237 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 -2.185441 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
2 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.246964 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
3 0.955953 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.428035 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -1.193092 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
4 -0.592825 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1.412302 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 1.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 0.053470 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
5 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.901345 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.520184 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
6 0.955953 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.279485 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... -1.968313 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -2.185441 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
7 0.955953 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.189970 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
8 -0.592825 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.365494 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.520184 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] 5.032415 -0.206016 1.172925 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
9 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.286491 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] 2.380648 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
10 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.862254 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 2.292380 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
11 -0.592825 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.458800 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
12 -1.367214 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.639397 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -1.065986 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
13 -0.592825 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.146086 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... 0.744962 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 1.172925 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
14 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.644143 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ... 0.357352 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... >50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34174 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.151677 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
34175 0.955953 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.382480 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34176 1.730342 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.407759 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... >50K
34177 1.730342 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.153272 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -1.065986 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34178 -1.367214 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.323123 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -1.193092 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 -2.185441 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34179 0.955953 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.070743 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -1.065986 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34180 -1.367214 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.761452 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 1.0] [0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 -2.185441 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34181 0.955953 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0] -0.367482 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 5.199568 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
34182 1.730342 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.428648 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 -1.065986 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34183 0.955953 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] -0.817212 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 1.172925 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
34184 1.730342 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.753877 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34185 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.311551 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 7.001429 0.053470 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34186 -1.367214 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] -0.720198 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34187 0.181564 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] 0.608356 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -1.193092 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -2.185441 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34188 -1.367214 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1.113276 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K

34189 rows × 15 columns

[41]:
cleaned_test_df
[41]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
34189 0.181564 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.704744 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 1.172925 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34190 0.181564 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.275191 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
34191 -1.367214 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.147766 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ... 0.357352 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -2.185441 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34192 0.955953 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.655990 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
34193 0.955953 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.818617 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34194 -1.367214 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.335985 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -1.065986 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34195 -0.592825 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] 0.197621 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34196 -0.592825 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1.407546 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 1.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -1.065986 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34197 -0.592825 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.149067 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -2.185441 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34198 -1.367214 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.020718 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34199 -0.592825 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.342117 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... -1.968313 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34200 -0.592825 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] -0.376300 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34201 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1.987078 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34202 -1.367214 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] 0.042399 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -1.065986 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34203 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.382011 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... 0.744962 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 1.172925 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48827 0.955953 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.332483 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 -1.065986 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48828 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.549787 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ... 0.357352 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 1.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48829 1.730342 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.978500 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... 0.744962 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48830 -0.592825 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.153595 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48831 0.955953 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.910723 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48832 1.730342 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.948722 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 1.172925 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48833 -0.592825 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 2.377888 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48834 -1.367214 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1.531605 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48835 0.955953 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] 1.515021 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.520184 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48836 -0.592825 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.527612 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48837 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.244809 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48838 1.730342 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] 1.250871 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 1.0] [0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48839 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1.759484 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 1.172925 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48840 0.181564 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.003732 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0] 2.380648 -0.206016 0.053470 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48841 -0.592825 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0] -0.071019 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] -0.271118 -0.206016 1.172925 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K

14653 rows × 15 columns

6. variance_threshold and variance parameter

There are two choices for variance_threshold parameter: * True: filtering numerical columns whose variance is less than the variance value. * False: do nothing

The default variance_threshold is False.

The default variance is 0.0.

[42]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class",
                                                variance_threshold=True, variance=6.0)
[43]:
cleaned_training_df
[43]:
workclass fnlwgt education education-num marital-status occupation relationship race sex native-country class
0 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.064247 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
1 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.009237 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
2 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.246964 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
3 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.428035 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -1.193092 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
4 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1.412302 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 1.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [0.0, 1.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
5 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.901345 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.520184 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
6 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.279485 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... -1.968313 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [0.0, 1.0] [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
7 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.189970 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
8 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.365494 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.520184 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
9 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.286491 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
10 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.862254 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
11 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.458800 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
12 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.639397 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
13 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.146086 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... 0.744962 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
14 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.644143 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ... 0.357352 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... >50K
... ... ... ... ... ... ... ... ... ... ... ...
34174 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.151677 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
34175 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.382480 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34176 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.407759 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... >50K
34177 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.153272 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34178 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.323123 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -1.193092 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34179 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.070743 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34180 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.761452 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 1.0] [0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34181 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0] -0.367482 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
34182 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.428648 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34183 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] -0.817212 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
34184 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.753877 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34185 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.311551 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34186 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] -0.720198 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34187 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] 0.608356 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -1.193092 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34188 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1.113276 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K

34189 rows × 11 columns

[44]:
cleaned_test_df
[44]:
workclass fnlwgt education education-num marital-status occupation relationship race sex native-country class
34189 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.704744 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34190 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.275191 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
34191 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.147766 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ... 0.357352 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34192 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.655990 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K
34193 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.818617 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34194 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.335985 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34195 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] 0.197621 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34196 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1.407546 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 1.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34197 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.149067 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34198 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.020718 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34199 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.342117 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... -1.968313 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34200 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] -0.376300 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34201 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1.987078 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34202 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] 0.042399 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... -0.030259 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
34203 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.382011 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... 0.744962 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
... ... ... ... ... ... ... ... ... ... ... ...
48827 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.332483 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48828 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.549787 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ... 0.357352 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 1.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0] [0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48829 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.978500 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... 0.744962 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48830 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.153595 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48831 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.910723 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48832 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -0.948722 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48833 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 2.377888 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48834 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1.531605 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48835 [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] 1.515021 [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.520184 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48836 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.527612 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48837 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.244809 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48838 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] 1.250871 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... -0.417870 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 1.0] [0.0, 1.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48839 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 1.759484 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48840 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] -1.003732 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 1.0, 0.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... <=50K
48841 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0] -0.071019 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 1.132573 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0] [1.0, 0.0] [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... >50K

14653 rows × 11 columns

7. num_scaling parameter

There are three choices for num_scaling parameter: * standardize: standarding each numerical column with mean value and std value of this column. The transformation is (x - mean) / std. * minmax: scaling each numerical column with min value and max value of this column. The transformation is (x - min) / (max - min) * maxabs: scaling each numerical column with max absolute value of this column. The transformation is x / maxabs.

The default num_scaling is standardize.

[55]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class",
                                                cat_encoding='no_encoding',
                                                num_scaling="minmax")
[56]:
cleaned_training_df
[56]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
0 0.50 State-gov 0.044302 Bachelors 0.800000 Never-married Adm-clerical Not-in-family White Male 0.25 0.00 0.50 United-States <=50K
1 0.75 Self-emp-not-inc 0.048238 Bachelors 0.800000 Married-civ-spouse Exec-managerial Husband White Male 0.00 0.00 0.00 United-States <=50K
2 0.50 Private 0.138113 HS-grad 0.533333 Divorced Handlers-cleaners Not-in-family White Male 0.00 0.00 0.50 United-States <=50K
3 0.75 Private 0.151068 11th 0.400000 Married-civ-spouse Handlers-cleaners Husband Black Male 0.00 0.00 0.50 United-States <=50K
4 0.25 Private 0.221488 Bachelors 0.800000 Married-civ-spouse Prof-specialty Wife Black Female 0.00 0.00 0.50 Cuba <=50K
5 0.50 Private 0.184932 Masters 0.866667 Married-civ-spouse Exec-managerial Wife White Female 0.00 0.00 0.50 United-States <=50K
6 0.75 Private 0.100448 9th 0.266667 Married-spouse-absent Other-service Not-in-family Black Female 0.00 0.00 0.00 Jamaica <=50K
7 0.75 Self-emp-not-inc 0.134036 HS-grad 0.533333 Married-civ-spouse Exec-managerial Husband White Male 0.00 0.00 0.50 United-States >50K
8 0.25 Private 0.022749 Masters 0.866667 Never-married Prof-specialty Not-in-family White Female 1.00 0.00 0.75 United-States >50K
9 0.50 Private 0.099947 Bachelors 0.800000 Married-civ-spouse Exec-managerial Husband White Male 0.50 0.00 0.50 United-States >50K
10 0.50 Private 0.182135 Some-college 0.600000 Married-civ-spouse Exec-managerial Husband Black Male 0.00 0.00 1.00 United-States >50K
11 0.25 State-gov 0.087619 Bachelors 0.800000 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0.00 0.00 0.50 India >50K
12 0.00 Private 0.074698 Bachelors 0.800000 Never-married Adm-clerical Own-child White Female 0.00 0.00 0.25 United-States <=50K
13 0.25 Private 0.130896 Assoc-acdm 0.733333 Never-married Sales Not-in-family Black Male 0.00 0.00 0.75 United-States <=50K
14 0.50 Private 0.074359 Assoc-voc 0.666667 Married-civ-spouse Craft-repair Husband Asian-Pac-Islander Male 0.00 0.00 0.50 ? >50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34174 0.50 Private 0.109592 Some-college 0.600000 Married-civ-spouse Prof-specialty Husband White Male 0.00 0.00 0.50 United-States >50K
34175 0.75 Private 0.093079 HS-grad 0.533333 Separated Handlers-cleaners Not-in-family White Male 0.00 0.00 0.50 United-States <=50K
34176 1.00 Private 0.091271 HS-grad 0.533333 Married-civ-spouse Other-service Husband Black Male 0.00 0.00 0.50 ? >50K
34177 1.00 Private 0.109478 Bachelors 0.800000 Divorced Prof-specialty Not-in-family White Female 0.00 0.00 0.25 United-States <=50K
34178 0.00 Private 0.143562 11th 0.400000 Never-married Other-service Own-child White Male 0.00 0.00 0.00 United-States <=50K
34179 0.75 Private 0.115383 Some-college 0.600000 Divorced Protective-serv Unmarried White Female 0.00 0.00 0.25 United-States <=50K
34180 0.00 Private 0.065966 Some-college 0.600000 Never-married Sales Other-relative Asian-Pac-Islander Male 0.00 0.00 0.00 India <=50K
34181 0.75 Self-emp-inc 0.094152 Some-college 0.600000 Married-civ-spouse Exec-managerial Husband White Male 0.00 0.75 0.50 United-States >50K
34182 1.00 Self-emp-not-inc 0.018231 HS-grad 0.533333 Married-civ-spouse Craft-repair Husband White Male 0.00 0.00 0.25 United-States <=50K
34183 0.75 Local-gov 0.061976 Bachelors 0.800000 Married-civ-spouse Prof-specialty Husband White Male 0.00 0.00 0.75 United-States >50K
34184 1.00 Private 0.066508 HS-grad 0.533333 Married-civ-spouse Other-service Husband Black Male 0.00 0.00 0.50 United-States <=50K
34185 0.50 Private 0.142734 HS-grad 0.533333 Never-married Sales Not-in-family White Male 0.00 1.00 0.50 El-Salvador <=50K
34186 0.00 ? 0.068917 HS-grad 0.533333 Never-married ? Own-child White Female 0.00 0.00 0.50 United-States <=50K
34187 0.50 ? 0.163970 11th 0.400000 Married-civ-spouse ? Wife White Female 0.00 0.00 0.00 United-States <=50K
34188 0.00 Private 0.200094 HS-grad 0.533333 Married-civ-spouse Machine-op-inspct Husband White Male 0.00 0.00 0.50 United-States <=50K

34189 rows × 15 columns

[57]:
cleaned_test_df
[57]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
34189 0.50 Self-emp-not-inc 0.170866 Some-college 0.600000 Married-civ-spouse Craft-repair Husband White Male 0.0 0.0 0.75 United-States <=50K
34190 0.50 State-gov 0.029210 Bachelors 0.800000 Married-civ-spouse Prof-specialty Husband White Male 0.0 0.0 0.50 United-States >50K
34191 0.00 Private 0.109872 Assoc-voc 0.666667 Never-married Other-service Own-child White Female 0.0 0.0 0.00 United-States <=50K
34192 0.75 State-gov 0.167378 Some-college 0.600000 Married-civ-spouse Exec-managerial Husband White Male 0.0 0.0 0.50 United-States >50K
34193 0.75 Private 0.179013 HS-grad 0.533333 Married-civ-spouse Prof-specialty Husband White Male 0.0 0.0 0.50 United-States <=50K
34194 0.00 Private 0.096406 Some-college 0.600000 Never-married Sales Own-child White Female 0.0 0.0 0.25 United-States <=50K
34195 0.25 Local-gov 0.134583 Some-college 0.600000 Married-civ-spouse Craft-repair Other-relative White Male 0.0 0.0 0.50 United-States <=50K
34196 0.25 Private 0.221148 Some-college 0.600000 Divorced Adm-clerical Unmarried Black Female 0.0 0.0 0.25 United-States <=50K
34197 0.25 State-gov 0.131109 Bachelors 0.800000 Never-married Prof-specialty Not-in-family White Female 0.0 0.0 0.00 United-States <=50K
34198 0.00 Private 0.118962 Some-college 0.600000 Separated Other-service Own-child White Male 0.0 0.0 0.50 United-States <=50K
34199 0.25 Private 0.095967 9th 0.266667 Separated Craft-repair Not-in-family White Male 0.0 0.0 0.50 United-States <=50K
34200 0.25 Local-gov 0.093522 Some-college 0.600000 Divorced Adm-clerical Unmarried White Female 0.0 0.0 0.50 United-States <=50K
34201 0.50 Private 0.262611 Some-college 0.600000 Married-civ-spouse Craft-repair Husband White Male 0.0 0.0 0.50 United-States <=50K
34202 0.00 ? 0.123478 Some-college 0.600000 Never-married ? Own-child White Female 0.0 0.0 0.25 United-States <=50K
34203 0.50 Private 0.021567 Assoc-acdm 0.733333 Married-spouse-absent Adm-clerical Other-relative White Male 0.0 0.0 0.75 United-States <=50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48827 0.75 Private 0.144232 HS-grad 0.533333 Separated Priv-house-serv Not-in-family White Female 0.0 0.0 0.25 United-States <=50K
48828 0.50 Private 0.159779 Assoc-voc 0.666667 Never-married Adm-clerical Unmarried Black Female 0.0 0.0 0.50 United-States <=50K
48829 1.00 Private 0.190452 Assoc-acdm 0.733333 Divorced Prof-specialty Not-in-family White Male 0.0 0.0 0.50 United-States <=50K
48830 0.25 Private 0.109455 HS-grad 0.533333 Married-civ-spouse Handlers-cleaners Husband White Male 0.0 0.0 0.50 United-States <=50K
48831 0.75 Private 0.185603 HS-grad 0.533333 Married-civ-spouse Adm-clerical Husband White Male 0.0 0.0 0.50 United-States <=50K
48832 1.00 Private 0.052567 HS-grad 0.533333 Married-civ-spouse Sales Husband White Male 0.0 0.0 0.75 United-States <=50K
48833 0.25 Private 0.290572 HS-grad 0.533333 Married-civ-spouse Craft-repair Husband White Male 0.0 0.0 0.50 United-States <=50K
48834 0.00 Private 0.230024 HS-grad 0.533333 Never-married Other-service Own-child White Female 0.0 0.0 0.50 United-States <=50K
48835 0.75 Local-gov 0.228838 Masters 0.866667 Divorced Other-service Not-in-family White Male 0.0 0.0 0.50 United-States <=50K
48836 0.25 Private 0.158193 Bachelors 0.800000 Never-married Prof-specialty Own-child White Male 0.0 0.0 0.50 United-States <=50K
48837 0.50 Private 0.137959 Bachelors 0.800000 Divorced Prof-specialty Not-in-family White Female 0.0 0.0 0.50 United-States <=50K
48838 1.00 ? 0.209939 HS-grad 0.533333 Widowed ? Other-relative Black Male 0.0 0.0 0.50 United-States <=50K
48839 0.50 Private 0.246328 Bachelors 0.800000 Married-civ-spouse Prof-specialty Husband White Male 0.0 0.0 0.75 United-States <=50K
48840 0.50 Private 0.048632 Bachelors 0.800000 Divorced Adm-clerical Own-child Asian-Pac-Islander Male 0.5 0.0 0.50 United-States <=50K
48841 0.25 Self-emp-inc 0.115363 Bachelors 0.800000 Married-civ-spouse Exec-managerial Husband White Male 0.0 0.0 0.75 United-States >50K

14653 rows × 15 columns

8. include_operators and exclude_operators parameter

The include_operators indicates which operator must be included in the cleaning pipeline. It is a list. For example: * ['one_hot', 'minmax', 'median', 'most_frequent']

The exclude_operators indicates which operator must be excluded in the cleaning pipeline. It has the same format with include_operators.

The valid choices for include_operators and exclude_operators: * one_hot * constant * most_frequent * drop * mean * median * standardize * minmax * maxabs

9. customized_cat_pipeline and customized_num_pipeline parameter

Experienced users can specify their own customized_cat_pipeline and customized_num_pipeline. The two parameters are lists including dictionaries of each component. Each compontent is also a dictionary including the name of specified operator and related parameters. For example: * [     {"cat_imputation": {"operator": 'constant', "cat_null_value": ['?'], "fill_val": "Hahahaha!!!!!"}}, ]

Users can also specifiy their own operators. They just need to define a typical class with the __init__ function, the fit, transform and fit_transform functions. When using them, the name of the class can be put at the operator’s position.

[58]:
from typing import Any, Union
import dask.dataframe as dd
import pandas as pd
import numpy as np

class MaxAbsScaler:
    def __init__(self) -> None:
        self.name = "minmaxScaler"

    def fit(self,
            df: pd.Series) -> Any:
        self.maxabs = df.abs().max()
        return self

    def transform(self,
            df: pd.Series) -> pd.Series:
        result = df.map(self.compute_val)
        return result

    def fit_transform(self,
            df: pd.Series) -> pd.Series:
        return  self.fit(df).transform(df)

    def compute_val(self, val):
        return val / self.maxabs

customized_cat_pipeline = [
    {"cat_imputation": {"operator": 'constant', "cat_null_value": ['?'], "fill_val": "Hahahaha!!!!!"}},
]
customized_num_pipeline = [
    {"num_scaling": {"operator": MaxAbsScaler}},
]
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, customized_cat_pipeline=customized_cat_pipeline, customized_num_pipeline=customized_num_pipeline)
[59]:
cleaned_training_df
[59]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
0 0.50 State-gov 0.052210 Bachelors 0.8125 Never-married Adm-clerical Not-in-family White Male 0.25 0.00 0.50 United-States <=50K
1 0.75 Self-emp-not-inc 0.056113 Bachelors 0.8125 Married-civ-spouse Exec-managerial Husband White Male 0.00 0.00 0.00 United-States <=50K
2 0.50 Private 0.145245 HS-grad 0.5625 Divorced Handlers-cleaners Not-in-family White Male 0.00 0.00 0.50 United-States <=50K
3 0.75 Private 0.158093 11th 0.4375 Married-civ-spouse Handlers-cleaners Husband Black Male 0.00 0.00 0.50 United-States <=50K
4 0.25 Private 0.227930 Bachelors 0.8125 Married-civ-spouse Prof-specialty Wife Black Female 0.00 0.00 0.50 Cuba <=50K
5 0.50 Private 0.191676 Masters 0.8750 Married-civ-spouse Exec-managerial Wife White Female 0.00 0.00 0.50 United-States <=50K
6 0.75 Private 0.107891 9th 0.3125 Married-spouse-absent Other-service Not-in-family Black Female 0.00 0.00 0.00 Jamaica <=50K
7 0.75 Self-emp-not-inc 0.141201 HS-grad 0.5625 Married-civ-spouse Exec-managerial Husband White Male 0.00 0.00 0.50 United-States >50K
8 0.25 Private 0.030835 Masters 0.8750 Never-married Prof-specialty Not-in-family White Female 1.00 0.00 0.75 United-States >50K
9 0.50 Private 0.107394 Bachelors 0.8125 Married-civ-spouse Exec-managerial Husband White Male 0.50 0.00 0.50 United-States >50K
10 0.50 Private 0.188902 Some-college 0.6250 Married-civ-spouse Exec-managerial Husband Black Male 0.00 0.00 1.00 United-States >50K
11 0.25 State-gov 0.095168 Bachelors 0.8125 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0.00 0.00 0.50 India >50K
12 0.00 Private 0.082354 Bachelors 0.8125 Never-married Adm-clerical Own-child White Female 0.00 0.00 0.25 United-States <=50K
13 0.25 Private 0.138087 Assoc-acdm 0.7500 Never-married Sales Not-in-family Black Male 0.00 0.00 0.75 United-States <=50K
14 0.50 Private 0.082018 Assoc-voc 0.6875 Married-civ-spouse Craft-repair Husband Asian-Pac-Islander Male 0.00 0.00 0.50 Hahahaha!!!!! >50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34174 0.50 Private 0.116960 Some-college 0.6250 Married-civ-spouse Prof-specialty Husband White Male 0.00 0.00 0.50 United-States >50K
34175 0.75 Private 0.100584 HS-grad 0.5625 Separated Handlers-cleaners Not-in-family White Male 0.00 0.00 0.50 United-States <=50K
34176 1.00 Private 0.098790 HS-grad 0.5625 Married-civ-spouse Other-service Husband Black Male 0.00 0.00 0.50 Hahahaha!!!!! >50K
34177 1.00 Private 0.116847 Bachelors 0.8125 Divorced Prof-specialty Not-in-family White Female 0.00 0.00 0.25 United-States <=50K
34178 0.00 Private 0.150649 11th 0.4375 Never-married Other-service Own-child White Male 0.00 0.00 0.00 United-States <=50K
34179 0.75 Private 0.122702 Some-college 0.6250 Divorced Protective-serv Unmarried White Female 0.00 0.00 0.25 United-States <=50K
34180 0.00 Private 0.073694 Some-college 0.6250 Never-married Sales Other-relative Asian-Pac-Islander Male 0.00 0.00 0.00 India <=50K
34181 0.75 Self-emp-inc 0.101648 Some-college 0.6250 Married-civ-spouse Exec-managerial Husband White Male 0.00 0.75 0.50 United-States >50K
34182 1.00 Self-emp-not-inc 0.026354 HS-grad 0.5625 Married-civ-spouse Craft-repair Husband White Male 0.00 0.00 0.25 United-States <=50K
34183 0.75 Local-gov 0.069738 Bachelors 0.8125 Married-civ-spouse Prof-specialty Husband White Male 0.00 0.00 0.75 United-States >50K
34184 1.00 Private 0.074232 HS-grad 0.5625 Married-civ-spouse Other-service Husband Black Male 0.00 0.00 0.50 United-States <=50K
34185 0.50 Private 0.149828 HS-grad 0.5625 Never-married Sales Not-in-family White Male 0.00 1.00 0.50 El-Salvador <=50K
34186 0.00 Hahahaha!!!!! 0.076621 HS-grad 0.5625 Never-married Hahahaha!!!!! Own-child White Female 0.00 0.00 0.50 United-States <=50K
34187 0.50 Hahahaha!!!!! 0.170887 11th 0.4375 Married-civ-spouse Hahahaha!!!!! Wife White Female 0.00 0.00 0.00 United-States <=50K
34188 0.00 Private 0.206713 HS-grad 0.5625 Married-civ-spouse Machine-op-inspct Husband White Male 0.00 0.00 0.50 United-States <=50K

34189 rows × 15 columns

[60]:
cleaned_test_df
[60]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
34189 0.50 Self-emp-not-inc 0.177726 Some-college 0.6250 Married-civ-spouse Craft-repair Husband White Male 0.0 0.0 0.75 United-States <=50K
34190 0.50 State-gov 0.037242 Bachelors 0.8125 Married-civ-spouse Prof-specialty Husband White Male 0.0 0.0 0.50 United-States >50K
34191 0.00 Private 0.117237 Assoc-voc 0.6875 Never-married Other-service Own-child White Female 0.0 0.0 0.00 United-States <=50K
34192 0.75 State-gov 0.174267 Some-college 0.6250 Married-civ-spouse Exec-managerial Husband White Male 0.0 0.0 0.50 United-States >50K
34193 0.75 Private 0.185806 HS-grad 0.5625 Married-civ-spouse Prof-specialty Husband White Male 0.0 0.0 0.50 United-States <=50K
34194 0.00 Private 0.103883 Some-college 0.6250 Never-married Sales Own-child White Female 0.0 0.0 0.25 United-States <=50K
34195 0.25 Local-gov 0.141744 Some-college 0.6250 Married-civ-spouse Craft-repair Other-relative White Male 0.0 0.0 0.50 United-States <=50K
34196 0.25 Private 0.227593 Some-college 0.6250 Divorced Adm-clerical Unmarried Black Female 0.0 0.0 0.25 United-States <=50K
34197 0.25 State-gov 0.138299 Bachelors 0.8125 Never-married Prof-specialty Not-in-family White Female 0.0 0.0 0.00 United-States <=50K
34198 0.00 Private 0.126252 Some-college 0.6250 Separated Other-service Own-child White Male 0.0 0.0 0.50 United-States <=50K
34199 0.25 Private 0.103447 9th 0.3125 Separated Craft-repair Not-in-family White Male 0.0 0.0 0.50 United-States <=50K
34200 0.25 Local-gov 0.101022 Some-college 0.6250 Divorced Adm-clerical Unmarried White Female 0.0 0.0 0.50 United-States <=50K
34201 0.50 Private 0.268713 Some-college 0.6250 Married-civ-spouse Craft-repair Husband White Male 0.0 0.0 0.50 United-States <=50K
34202 0.00 Hahahaha!!!!! 0.130730 Some-college 0.6250 Never-married Hahahaha!!!!! Own-child White Female 0.0 0.0 0.25 United-States <=50K
34203 0.50 Private 0.029663 Assoc-acdm 0.7500 Married-spouse-absent Adm-clerical Other-relative White Male 0.0 0.0 0.75 United-States <=50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48827 0.75 Private 0.151313 HS-grad 0.5625 Separated Priv-house-serv Not-in-family White Female 0.0 0.0 0.25 United-States <=50K
48828 0.50 Private 0.166731 Assoc-voc 0.6875 Never-married Adm-clerical Unmarried Black Female 0.0 0.0 0.50 United-States <=50K
48829 1.00 Private 0.197150 Assoc-acdm 0.7500 Divorced Prof-specialty Not-in-family White Male 0.0 0.0 0.50 United-States <=50K
48830 0.25 Private 0.116824 HS-grad 0.5625 Married-civ-spouse Handlers-cleaners Husband White Male 0.0 0.0 0.50 United-States <=50K
48831 0.75 Private 0.192341 HS-grad 0.5625 Married-civ-spouse Adm-clerical Husband White Male 0.0 0.0 0.50 United-States <=50K
48832 1.00 Private 0.060407 HS-grad 0.5625 Married-civ-spouse Sales Husband White Male 0.0 0.0 0.75 United-States <=50K
48833 0.25 Private 0.296442 HS-grad 0.5625 Married-civ-spouse Craft-repair Husband White Male 0.0 0.0 0.50 United-States <=50K
48834 0.00 Private 0.236395 HS-grad 0.5625 Never-married Other-service Own-child White Female 0.0 0.0 0.50 United-States <=50K
48835 0.75 Local-gov 0.235218 Masters 0.8750 Divorced Other-service Not-in-family White Male 0.0 0.0 0.50 United-States <=50K
48836 0.25 Private 0.165158 Bachelors 0.8125 Never-married Prof-specialty Own-child White Male 0.0 0.0 0.50 United-States <=50K
48837 0.50 Private 0.145092 Bachelors 0.8125 Divorced Prof-specialty Not-in-family White Female 0.0 0.0 0.50 United-States <=50K
48838 1.00 Hahahaha!!!!! 0.216476 HS-grad 0.5625 Widowed Hahahaha!!!!! Other-relative Black Male 0.0 0.0 0.50 United-States <=50K
48839 0.50 Private 0.252564 Bachelors 0.8125 Married-civ-spouse Prof-specialty Husband White Male 0.0 0.0 0.75 United-States <=50K
48840 0.50 Private 0.056503 Bachelors 0.8125 Divorced Adm-clerical Own-child Asian-Pac-Islander Male 0.5 0.0 0.50 United-States <=50K
48841 0.25 Self-emp-inc 0.122683 Bachelors 0.8125 Married-civ-spouse Exec-managerial Husband White Male 0.0 0.0 0.75 United-States >50K

14653 rows × 15 columns

[ ]: