Column Headers

Introduction

The function clean_headers() cleans column headers in a DataFrame, and standardizes them in a desired format.

Column names can be converted to the following case styles via the case parameter:

  • snake: “column_name”

  • kebab: “column-name”

  • camel: “columnName”

  • pascal: “ColumnName”

  • const: “COLUMN_NAME”

  • sentence: “Column name”

  • title: “Column Name”

  • lower: “column name”

  • upper: “COLUMN NAME”

After cleaning, a report is printed that provides the number and percentage of values that were cleaned (the value must be transformed).

The following sections demonstrate the functionality of clean_headers().

An example dirty dataset

[1]:
import pandas as pd
import numpy as np
df = pd.DataFrame(
        {
            "ISBN": [9781455582341],
            "isbn": [1455582328],
            "bookTitle": ["How Google Works"],
            "__Author": ["Eric Schmidt, Jonathan Rosenberg"],
            "Publication (year)": [2014],
            "éditeur": ["Grand Central Publishing"],
            "Number_Of_Pages": [305],
            "★ Rating": [4.06],
        }
    )
df
[1]:
ISBN isbn bookTitle __Author Publication (year) éditeur Number_Of_Pages ★ Rating
0 9781455582341 1455582328 How Google Works Eric Schmidt, Jonathan Rosenberg 2014 Grand Central Publishing 305 4.06

1. Default clean_headers()

By default, the case parameter is set to “snake” and the remove_accents parameter is set to True (strip accents and symbols from the column name).

[2]:
from dataprep.clean import clean_headers
clean_headers(df)
Column Headers Cleaning Report:
        8 values cleaned (100.0%)
[2]:
isbn isbn_1 book_title author publication_year editeur number_of_pages rating
0 9781455582341 1455582328 How Google Works Eric Schmidt, Jonathan Rosenberg 2014 Grand Central Publishing 305 4.06

Note that “_1” is appended to the second instance of the column name “isbn” to distinguish it from the first instance after the transformation. Consequently, all column names are considered to have been cleaned in this example.

Column names that are duplicated as a result of calling clean_headers() are automatically renamed to append a number to the end. The suffix used to append the number is inferred from the case parameter.

2. case parameter

This section demonstrates the supported case styles.

kebab

[3]:
clean_headers(df, case="kebab")
Column Headers Cleaning Report:
        8 values cleaned (100.0%)
[3]:
isbn isbn-1 book-title author publication-year editeur number-of-pages rating
0 9781455582341 1455582328 How Google Works Eric Schmidt, Jonathan Rosenberg 2014 Grand Central Publishing 305 4.06

camel

[4]:
clean_headers(df, case="camel")
Column Headers Cleaning Report:
        7 values cleaned (87.5%)
[4]:
isbn isbn1 bookTitle author publicationYear editeur numberOfPages rating
0 9781455582341 1455582328 How Google Works Eric Schmidt, Jonathan Rosenberg 2014 Grand Central Publishing 305 4.06

pascal

[5]:
clean_headers(df, case="pascal")
Column Headers Cleaning Report:
        8 values cleaned (100.0%)
[5]:
Isbn Isbn1 BookTitle Author PublicationYear Editeur NumberOfPages Rating
0 9781455582341 1455582328 How Google Works Eric Schmidt, Jonathan Rosenberg 2014 Grand Central Publishing 305 4.06

const

[6]:
clean_headers(df, case="const")
Column Headers Cleaning Report:
        7 values cleaned (87.5%)
[6]:
ISBN ISBN_1 BOOK_TITLE AUTHOR PUBLICATION_YEAR EDITEUR NUMBER_OF_PAGES RATING
0 9781455582341 1455582328 How Google Works Eric Schmidt, Jonathan Rosenberg 2014 Grand Central Publishing 305 4.06

sentence

[7]:
clean_headers(df, case="sentence")
Column Headers Cleaning Report:
        7 values cleaned (87.5%)
[7]:
Isbn Isbn 1 Book title Author Publication (year) Editeur Number of pages Rating
0 9781455582341 1455582328 How Google Works Eric Schmidt, Jonathan Rosenberg 2014 Grand Central Publishing 305 4.06

title

[8]:
clean_headers(df, case="title")
Column Headers Cleaning Report:
        7 values cleaned (87.5%)
[8]:
Isbn Isbn 1 Book Title Author Publication (year) Editeur Number Of Pages Rating
0 9781455582341 1455582328 How Google Works Eric Schmidt, Jonathan Rosenberg 2014 Grand Central Publishing 305 4.06

lower

[9]:
clean_headers(df, case="lower")
Column Headers Cleaning Report:
        8 values cleaned (100.0%)
[9]:
isbn isbn 1 book title author publication (year) editeur number of pages rating
0 9781455582341 1455582328 How Google Works Eric Schmidt, Jonathan Rosenberg 2014 Grand Central Publishing 305 4.06

upper

[10]:
clean_headers(df, case="upper")
Column Headers Cleaning Report:
        7 values cleaned (87.5%)
[10]:
ISBN ISBN 1 BOOK TITLE AUTHOR PUBLICATION (YEAR) EDITEUR NUMBER OF PAGES RATING
0 9781455582341 1455582328 How Google Works Eric Schmidt, Jonathan Rosenberg 2014 Grand Central Publishing 305 4.06

3. replace parameter

The replace parameter takes in a dictionary of values in the column names to be replaced by new values.

[11]:
clean_headers(df, replace={"éditeur": "publisher", "★": "star"})
Column Headers Cleaning Report:
        8 values cleaned (100.0%)
[11]:
isbn isbn_1 book_title author publication_year publisher number_of_pages star_rating
0 9781455582341 1455582328 How Google Works Eric Schmidt, Jonathan Rosenberg 2014 Grand Central Publishing 305 4.06

4. remove_accents parameter

By default, the remove_accents parameter is set to True (strip accents and symbols from the column names). If set to False, any accents or symbols are kept in.

[12]:
clean_headers(df, remove_accents=False)
Column Headers Cleaning Report:
        7 values cleaned (87.5%)
[12]:
isbn isbn_1 book_title author publication_year éditeur number_of_pages ★_rating
0 9781455582341 1455582328 How Google Works Eric Schmidt, Jonathan Rosenberg 2014 Grand Central Publishing 305 4.06

5. Null headers

Null column headers in the DataFrame are replaced with the default value “header”. As with other column names, duplicated values are renamed with appended numbers. Null header values include np.nan, None and the empty string.

[13]:
df = pd.DataFrame({"": [9781455582341],
                   np.nan: ["How Google Works"],
                   None: ["Eric Schmidt, Jonathan Rosenberg"],
                   "N/A": [2014],
                  })
df
[13]:
NaN NaN N/A
0 9781455582341 How Google Works Eric Schmidt, Jonathan Rosenberg 2014
[14]:
clean_headers(df)
Column Headers Cleaning Report:
        4 values cleaned (100.0%)
[14]:
header header_1 header_2 n_a
0 9781455582341 How Google Works Eric Schmidt, Jonathan Rosenberg 2014