Column Headers¶

Introduction¶

The function clean_headers() cleans column headers in a DataFrame, and standardizes them in a desired format.

Column names can be converted to the following case styles via the case parameter:

snake: “column_name”
kebab: “column-name”
camel: “columnName”
pascal: “ColumnName”
const: “COLUMN_NAME”
sentence: “Column name”
title: “Column Name”
lower: “column name”
upper: “COLUMN NAME”

After cleaning, a report is printed that provides the number and percentage of values that were cleaned (the value must be transformed).

The following sections demonstrate the functionality of clean_headers().

An example dirty dataset¶

[1]:

import pandas as pd
import numpy as np
df = pd.DataFrame(
        {
            "ISBN": [9781455582341],
            "isbn": [1455582328],
            "bookTitle": ["How Google Works"],
            "__Author": ["Eric Schmidt, Jonathan Rosenberg"],
            "Publication (year)": [2014],
            "éditeur": ["Grand Central Publishing"],
            "Number_Of_Pages": [305],
            "★ Rating": [4.06],
        }
    )
df

[1]:

	ISBN	isbn	bookTitle	__Author	Publication (year)	éditeur	Number_Of_Pages	★ Rating
0	9781455582341	1455582328	How Google Works	Eric Schmidt, Jonathan Rosenberg	2014	Grand Central Publishing	305	4.06

1. Default `clean_headers()`¶

By default, the case parameter is set to “snake” and the remove_accents parameter is set to True (strip accents and symbols from the column name).

[2]:

from dataprep.clean import clean_headers
clean_headers(df)

Column Headers Cleaning Report:
        8 values cleaned (100.0%)

[2]:

	isbn	isbn_1	book_title	author	publication_year	editeur	number_of_pages	rating
0	9781455582341	1455582328	How Google Works	Eric Schmidt, Jonathan Rosenberg	2014	Grand Central Publishing	305	4.06

Note that “_1” is appended to the second instance of the column name “isbn” to distinguish it from the first instance after the transformation. Consequently, all column names are considered to have been cleaned in this example.

Column names that are duplicated as a result of calling clean_headers() are automatically renamed to append a number to the end. The suffix used to append the number is inferred from the case parameter.

2. `case` parameter¶

This section demonstrates the supported case styles.

kebab¶

[3]:

clean_headers(df, case="kebab")

Column Headers Cleaning Report:
        8 values cleaned (100.0%)

[3]:

	isbn	isbn-1	book-title	author	publication-year	editeur	number-of-pages	rating
0	9781455582341	1455582328	How Google Works	Eric Schmidt, Jonathan Rosenberg	2014	Grand Central Publishing	305	4.06

camel¶

[4]:

clean_headers(df, case="camel")

Column Headers Cleaning Report:
        7 values cleaned (87.5%)

[4]:

	isbn	isbn1	bookTitle	author	publicationYear	editeur	numberOfPages	rating
0	9781455582341	1455582328	How Google Works	Eric Schmidt, Jonathan Rosenberg	2014	Grand Central Publishing	305	4.06

pascal¶

[5]:

clean_headers(df, case="pascal")

Column Headers Cleaning Report:
        8 values cleaned (100.0%)

[5]:

	Isbn	Isbn1	BookTitle	Author	PublicationYear	Editeur	NumberOfPages	Rating
0	9781455582341	1455582328	How Google Works	Eric Schmidt, Jonathan Rosenberg	2014	Grand Central Publishing	305	4.06

const¶

[6]:

clean_headers(df, case="const")

Column Headers Cleaning Report:
        7 values cleaned (87.5%)

[6]:

	ISBN	ISBN_1	BOOK_TITLE	AUTHOR	PUBLICATION_YEAR	EDITEUR	NUMBER_OF_PAGES	RATING
0	9781455582341	1455582328	How Google Works	Eric Schmidt, Jonathan Rosenberg	2014	Grand Central Publishing	305	4.06

sentence¶

[7]:

clean_headers(df, case="sentence")

Column Headers Cleaning Report:
        7 values cleaned (87.5%)

[7]:

	Isbn	Isbn 1	Book title	Author	Publication (year)	Editeur	Number of pages	Rating
0	9781455582341	1455582328	How Google Works	Eric Schmidt, Jonathan Rosenberg	2014	Grand Central Publishing	305	4.06

title¶

[8]:

clean_headers(df, case="title")

Column Headers Cleaning Report:
        7 values cleaned (87.5%)

[8]:

	Isbn	Isbn 1	Book Title	Author	Publication (year)	Editeur	Number Of Pages	Rating
0	9781455582341	1455582328	How Google Works	Eric Schmidt, Jonathan Rosenberg	2014	Grand Central Publishing	305	4.06

lower¶

[9]:

clean_headers(df, case="lower")

Column Headers Cleaning Report:
        8 values cleaned (100.0%)

[9]:

	isbn	isbn 1	book title	author	publication (year)	editeur	number of pages	rating
0	9781455582341	1455582328	How Google Works	Eric Schmidt, Jonathan Rosenberg	2014	Grand Central Publishing	305	4.06

upper¶

[10]:

clean_headers(df, case="upper")

Column Headers Cleaning Report:
        7 values cleaned (87.5%)

[10]:

	ISBN	ISBN 1	BOOK TITLE	AUTHOR	PUBLICATION (YEAR)	EDITEUR	NUMBER OF PAGES	RATING
0	9781455582341	1455582328	How Google Works	Eric Schmidt, Jonathan Rosenberg	2014	Grand Central Publishing	305	4.06

3. `replace` parameter¶

The replace parameter takes in a dictionary of values in the column names to be replaced by new values.

[11]:

clean_headers(df, replace={"éditeur": "publisher", "★": "star"})

Column Headers Cleaning Report:
        8 values cleaned (100.0%)

[11]:

	isbn	isbn_1	book_title	author	publication_year	publisher	number_of_pages	star_rating
0	9781455582341	1455582328	How Google Works	Eric Schmidt, Jonathan Rosenberg	2014	Grand Central Publishing	305	4.06

4. `remove_accents` parameter¶

By default, the remove_accents parameter is set to True (strip accents and symbols from the column names). If set to False, any accents or symbols are kept in.

[12]:

clean_headers(df, remove_accents=False)

Column Headers Cleaning Report:
        7 values cleaned (87.5%)

[12]:

	isbn	isbn_1	book_title	author	publication_year	éditeur	number_of_pages	★_rating
0	9781455582341	1455582328	How Google Works	Eric Schmidt, Jonathan Rosenberg	2014	Grand Central Publishing	305	4.06

5. Null headers¶

Null column headers in the DataFrame are replaced with the default value “header”. As with other column names, duplicated values are renamed with appended numbers. Null header values include np.nan, None and the empty string.

[13]:

df = pd.DataFrame({"": [9781455582341],
                   np.nan: ["How Google Works"],
                   None: ["Eric Schmidt, Jonathan Rosenberg"],
                   "N/A": [2014],
                  })
df

[13]:

		NaN	NaN	N/A
0	9781455582341	How Google Works	Eric Schmidt, Jonathan Rosenberg	2014

[14]:

clean_headers(df)

Column Headers Cleaning Report:
        4 values cleaned (100.0%)

[14]:

	header	header_1	header_2	n_a
0	9781455582341	How Google Works	Eric Schmidt, Jonathan Rosenberg	2014