Duplicate Values¶

Introduction¶

The function clean_duplication() creates a user interface that clusters duplicate values and allows the user to merge them into standardized values. The following clustering methods are provided:

fingerprint¶

This is the process for creating a fingerprint key:

remove leading and trailing whitespace
convert to lowercase
remove punctuation and control characters
normalize extended western characters to ASCII
split into whitespace separated tokens
sort tokens and remove duplicates
join tokens back together

ngram-fingerprint¶

This is the process for creating a n-gram fingerprint key:

convert to lowercase
remove punctuation, whitespace and control characters
get string n-grams
sort n-grams and remove duplicates
join sorted n grams back together
normalize extended western characters to ASCII

A textbox is provided for choosing the n-gram size.

phonetic-fingerprint¶

Uses the double metaphone algorithm for generating phonetic-fingerprint keys. The metaphone library is used.

levenshtein distance¶

Blocking is used to speed up the process, blocks are obtained where strings in the same block share a substring of a given blocking size. Only strings within the same block are compared using the levenshtein distance function. If two values have a distance less than or equal to the given radius they are added to the same cluster. Textboxes are provided for choosing the block size and the radius.

The Levenshtein library is used for a fast levenshtein distance implementation.

Clustering methods are taken from the OpenRefine project and the simile-vicino project, you can read more about these clustering methods here.

The df_var_name parameter sets the variable name to be used for the dataframe when creating replace function calls.

The page_size parameter can be used to set the number of clusters that are displayed on each page of the user interface.

Most of the functionality is provided through an interactive user interface which will be introduced shortly.

An example dirty dataset¶

[1]:

import pandas as pd
import numpy as np
df = pd.DataFrame(
    {
        "city": [
            "Québec",
            "Quebec",
            "Vancouver",
            "Vancouver",
            "vancouver",
            " Vancuver ",
            "Toronto",
            "Toront",
            "Tronto",
            "Ottowa",
            "otowa"
        ]
    }
)
df

[1]:

	city
0	Québec
1	Quebec
2	Vancouver
3	Vancouver
4	vancouver
5	Vancuver
6	Toronto
7	Toront
8	Tronto
9	Ottowa
10	otowa

[2]:

import pandas as pd
import numpy as np
cities = pd.DataFrame(
    {
        "city": [
            "Québec",
            "Quebec",
            "Vancouver",
            "Vancouver",
            "vancouver",
            " Vancuver ",
            "Toronto",
            "Toront",
            "Tronto",
            "Ottowa",
            "otowa"
        ]
    }
)
cities

[2]:

	city
0	Québec
1	Quebec
2	Vancouver
3	Vancouver
4	vancouver
5	Vancuver
6	Toronto
7	Toront
8	Tronto
9	Ottowa
10	otowa

1. Default `clean_duplication()`¶

By default the df_var_name parameter equals to default, which means the prefix of the final result DataFrame is the same with the name of input dataframe. And the page_size variable is set to 5. Clustering methods can be toggled using the dropdown menu at the top of the GUI. Select which clusters you would like to merge using the checkboxes under the “Merge?” heading. Then press the “Merge and Re-Cluster” button to merge the cluster. If the “export code” checkbox is selected, code for merging the clusters will be created and added to the notebook cell. Finally, you can press the “finish” button to close the GUI and see the final DataFrame created.

[3]:

from dataprep.clean import clean_duplication
clean_duplication(df, "city")

[4]:

from dataprep.clean import clean_duplication
clean_duplication(cities, "city")

2. `df_var_name` parameter¶

Pandas Series.replace function calls are created and added to the current notebook cell when merging a cluster with the “export code” checkbox selected. This parameter allows for changing the DataFrame variable name used in the exported code.

[5]:

clean_duplication(df, "city", df_var_name="dataframe")

3. `page_size` parameter¶

This parameter allows for changing the number of clusters that are displayed on each page of the user interface.

[6]:

clean_duplication(df, "city", page_size=1)