The function clean_headers() cleans column headers in a DataFrame, and standardizes them in a desired format.
clean_headers()
Column names can be converted to the following case styles via the case parameter:
case
snake: “column_name”
kebab: “column-name”
camel: “columnName”
pascal: “ColumnName”
const: “COLUMN_NAME”
sentence: “Column name”
title: “Column Name”
lower: “column name”
upper: “COLUMN NAME”
After cleaning, a report is printed that provides the number and percentage of values that were cleaned (the value must be transformed).
The following sections demonstrate the functionality of clean_headers().
[1]:
import pandas as pd import numpy as np df = pd.DataFrame( { "ISBN": [9781455582341], "isbn": [1455582328], "bookTitle": ["How Google Works"], "__Author": ["Eric Schmidt, Jonathan Rosenberg"], "Publication (year)": [2014], "éditeur": ["Grand Central Publishing"], "Number_Of_Pages": [305], "★ Rating": [4.06], } ) df
By default, the case parameter is set to “snake” and the remove_accents parameter is set to True (strip accents and symbols from the column name).
remove_accents
[2]:
from dataprep.clean import clean_headers clean_headers(df)
Column Headers Cleaning Report: 8 values cleaned (100.0%)
Note that “_1” is appended to the second instance of the column name “isbn” to distinguish it from the first instance after the transformation. Consequently, all column names are considered to have been cleaned in this example.
Column names that are duplicated as a result of calling clean_headers() are automatically renamed to append a number to the end. The suffix used to append the number is inferred from the case parameter.
This section demonstrates the supported case styles.
[3]:
clean_headers(df, case="kebab")
[4]:
clean_headers(df, case="camel")
Column Headers Cleaning Report: 7 values cleaned (87.5%)
[5]:
clean_headers(df, case="pascal")
[6]:
clean_headers(df, case="const")
[7]:
clean_headers(df, case="sentence")
[8]:
clean_headers(df, case="title")
[9]:
clean_headers(df, case="lower")
[10]:
clean_headers(df, case="upper")
replace
The replace parameter takes in a dictionary of values in the column names to be replaced by new values.
[11]:
clean_headers(df, replace={"éditeur": "publisher", "★": "star"})
By default, the remove_accents parameter is set to True (strip accents and symbols from the column names). If set to False, any accents or symbols are kept in.
[12]:
clean_headers(df, remove_accents=False)
Null column headers in the DataFrame are replaced with the default value “header”. As with other column names, duplicated values are renamed with appended numbers. Null header values include np.nan, None and the empty string.
np.nan
None
[13]:
df = pd.DataFrame({"": [9781455582341], np.nan: ["How Google Works"], None: ["Eric Schmidt, Jonathan Rosenberg"], "N/A": [2014], }) df
[14]:
clean_headers(df)
Column Headers Cleaning Report: 4 values cleaned (100.0%)