The function clean_text() cleans text data in a DataFrame column.
clean_text()
Using a default or customized pipeline, the function performs a series of cleaning operations on the data.
The following sections demonstrate the functionality of clean_text().
[1]:
import numpy as np import pandas as pd pd.set_option("display.max_colwidth", None) df = pd.DataFrame( { "text": [ "'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.", "The cast played Shakespeare.<br /><br />Shakespeare lost.", "Simon of the Desert (Simón del desierto) is a 1965 film directed by Luis Buñuel.", "[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects (!), etc...}", "<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968:\tA video essay</a>", "Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.", "#GameOfThrones: Season 8 is #Rotten at 54% on the #Tomatometer. But does it deserve to be?", "Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3", 123, np.nan, "NULL", ] } ) df
The default pipeline for the clean_text() function is the following:
fillna: Replace all null values with NaN.
lowercase: Convert all characters to lowercase.
remove_digits: Remove numbers.
remove_html Remove HTML tags.
remove_urls: Remove URLs.
remove_punctuation: Remove punctuation marks.
remove_accents: Remove accent marks.
remove_stopwords: Remove stopwords.
remove_whitespace: Remove extra spaces, and tabs and newlines.
[2]:
from dataprep.clean import clean_text clean_text(df, "text")
By default, the stopwords removed are the set of words in NLTK’s English stopwords. To remove a different set of words, pass the set into the stopwords parameter.
stopwords
[3]:
from dataprep.clean import clean_text clean_text(df, "text", stopwords={"imdb", "film"})
Users can pass in a custom pipeline to clean_text() using the pipeline parameter.
pipeline
[4]:
custom_pipeline = [ {"operator": "lowercase"}, {"operator": "remove_digits"}, {"operator": "remove_whitespace"}, ] clean_text(df, "text", pipeline=custom_pipeline)
Users can also define and pass in their own functions using the pipeline parameter.
[5]:
import re def split(text: str) -> str: return str(text).split() def replace_z(text: str, value: str) -> str: return re.sub(r"z", value, str(text), flags=re.I) custom_pipeline = [ {"operator": "lowercase"}, {"operator": "remove_digits"}, {"operator": split}, {"operator": replace_z, "parameters": {"value": "*"}}, {"operator": "remove_whitespace"}, ] clean_text(df, "text", pipeline=custom_pipeline)
In general, custom pipelines can be defined using the form:
[6]:
custom_pipeline = [ { "operator": "<operator_name>", "parameters": {"<parameter_name>": "<parameter_value>"}, } ]
To get the default pipeline in the form of a list, call default_text_pipeline().
default_text_pipeline()
This can be used as a template to build a list of cleaning operations to be passed into the pipeline parameter.
[7]:
from dataprep.clean import default_text_pipeline default_text_pipeline()
[{'operator': 'fillna'}, {'operator': 'lowercase'}, {'operator': 'remove_digits'}, {'operator': 'remove_html'}, {'operator': 'remove_urls'}, {'operator': 'remove_punctuation'}, {'operator': 'remove_accents'}, {'operator': 'remove_stopwords', 'parameters': {'stopwords': None}}, {'operator': 'remove_whitespace'}]
This section demonstrates the built-in cleaning operations which can be called using the pipeline parameter.
clean_text() assumes the DataFrame column contains text data. As such, any int values will be cast to str after applying a cleaning function.
int
str
By default, fillna replaces all null values with NaN.
fillna
NaN
[8]:
custom_pipeline = [{"operator": "fillna"}] clean_text(df, "text", pipeline=custom_pipeline)
To specify a specific value to replace null values, use the value parameter.
value
[9]:
custom_pipeline = [{"operator": "fillna", "parameters": {"value": "<NAN>"}}] clean_text(df, "text", pipeline=custom_pipeline)
Convert all characters to lowercase.
[10]:
custom_pipeline = [{"operator": "lowercase"}] clean_text(df, "text", pipeline=custom_pipeline)
Convert the first character of the string to uppercase and all remaining characters to lowercase.
[11]:
custom_pipeline = [{"operator": "sentence_case"}] clean_text(df, "text", pipeline=custom_pipeline)
Convert the first character of each word to uppercase and the remaining words to lowercase.
[12]:
custom_pipeline = [{"operator": "title_case"}] clean_text(df, "text", pipeline=custom_pipeline)
Convert all characters to uppercase.
[13]:
custom_pipeline = [{"operator": "uppercase"}] clean_text(df, "text", pipeline=custom_pipeline)
Remove accents (diacritic marks) from the text.
[14]:
custom_pipeline = [{"operator": "remove_accents"}] clean_text(df, "text", pipeline=custom_pipeline)
Remove text between brackets.
The style of the brackets can be specified using the brackets parameter:
brackets
“angle”: <>
<>
“curly”: {}
{}
“round”: ()
()
“square”: []
[]
By default, the inclusive parameter is set to True and the brackets are removed along with the text in between.
inclusive
[15]:
custom_pipeline = [ {"operator": "remove_bracketed", "parameters": {"brackets": "round"}} ] clean_text(df, "text", pipeline=custom_pipeline)
To remove the text but keep the brackets, set inclusive to False.
[16]:
custom_pipeline = [ { "operator": "remove_bracketed", "parameters": {"brackets": "round", "inclusive": False}, } ] clean_text(df, "text", pipeline=custom_pipeline)
The brackets parameter can also take in a set, which allows multiple bracket styles to be specified at a time.
[17]:
custom_pipeline = [ { "operator": "remove_bracketed", "parameters": {"brackets": {"angle", "curly", "round", "square"}}, } ] clean_text(df, "text", pipeline=custom_pipeline)
Remove all digits.
[18]:
custom_pipeline = [{"operator": "remove_digits"}] clean_text(df, "text", pipeline=custom_pipeline)
Remove HTML tags, including the non-breaking space .
[19]:
custom_pipeline = [{"operator": "remove_html"}] clean_text(df, "text", pipeline=custom_pipeline)
Remove substrings that start with the prefix(es) specified in the prefix parameter.
prefix
[20]:
custom_pipeline = [{"operator": "remove_prefixed", "parameters": {"prefix": "#"}}] clean_text(df, "text", pipeline=custom_pipeline)
To specify multiple prefixes, pass in a set of the prefixes to the prefix parameter.
[21]:
custom_pipeline = [ {"operator": "remove_prefixed", "parameters": {"prefix": {"#", "@"}}} ] clean_text(df, "text", pipeline=custom_pipeline)
Remove all punctuation marks defined in Python’s string.punctuation.
string.punctuation
[22]:
custom_pipeline = [{"operator": "remove_punctuation"}] clean_text(df, "text", pipeline=custom_pipeline)
Remove common words. By default, the set of stopwords to remove is NLTK’s English stopwords.
[23]:
custom_pipeline = [{"operator": "remove_stopwords"}] clean_text(df, "text", pipeline=custom_pipeline)
To use a custom set of words, pass the set into the stopwords parameter.
[24]:
custom_pipeline = [ {"operator": "remove_stopwords", "parameters": {"stopwords": {"imdb", "film"}}} ] clean_text(df, "text", pipeline=custom_pipeline)
Alternatively, expand upon the default set of stopwords by importing dataprep.assets.english_stopwords and adding custom words.
dataprep.assets.english_stopwords
[25]:
from dataprep.assets.english_stopwords import english_stopwords custom_stopwords = english_stopwords.copy() custom_stopwords.add("imdb") custom_stopwords.add("film") custom_pipeline = [ { "operator": "remove_stopwords", "parameters": {"stopwords": custom_stopwords}, } ] clean_text(df, "text", pipeline=custom_pipeline)
Remove URLs. Substrings that start with “http” or “www” are considered URLs.
[26]:
custom_pipeline = [{"operator": "remove_urls"}] clean_text(df, "text", pipeline=custom_pipeline)
Remove extra spaces (two or more) along with tabs and newlines. Leading and trailing spaces are also removed.
[27]:
custom_pipeline = [{"operator": "remove_whitespace"}] clean_text(df, "text", pipeline=custom_pipeline)
Replace text between brackets with the value.
By default, the inclusive parameter is set to True and the brackets are also replaced by the value along with the text in between.
[28]:
custom_pipeline = [ { "operator": "replace_bracketed", "parameters": {"brackets": "square", "value": "**SPOILERS**"}, } ] clean_text(df, "text", pipeline=custom_pipeline)
To replace the text, but keep the brackets, set inclusive to False.
[29]:
custom_pipeline = [ { "operator": "replace_bracketed", "parameters": { "brackets": "square", "value": "**SPOILERS**", "inclusive": False, }, } ] clean_text(df, "text", pipeline=custom_pipeline)
[30]:
custom_pipeline = [ { "operator": "replace_bracketed", "parameters": { "brackets": {"angle", "curly", "round", "square"}, "value": "<REDACTED>", }, } ] clean_text(df, "text", pipeline=custom_pipeline)
To assign different replacement values to different bracket styles, chain together replace_bracketed operations.
replace_bracketed
[31]:
custom_pipeline = [ { "operator": "replace_bracketed", "parameters": { "brackets": "square", "value": "**SPOILERS**", }, }, { "operator": "replace_bracketed", "parameters": { "brackets": "curly", "value": "in every aspect.", }, }, ] clean_text(df, "text", pipeline=custom_pipeline)
Replace all digits with the value. By default, the block parameter is set to True and only blocks of digits, i.e. tokens composed solely of numbers, are removed.
block
[32]:
custom_pipeline = [{"operator": "replace_digits", "parameters": {"value": "X"}}] clean_text(df, "text", pipeline=custom_pipeline)
To replace all digits appearing in the text, set block to False.
[33]:
custom_pipeline = [ {"operator": "replace_digits", "parameters": {"value": "X", "block": False}} ] clean_text(df, "text", pipeline=custom_pipeline)
Replace all substrings that start with the prefix(es) specified in the prefix parameter with the value.
[34]:
custom_pipeline = [ { "operator": "replace_prefixed", "parameters": {"prefix": "#", "value": "<HASHTAG>"}, } ] clean_text(df, "text", pipeline=custom_pipeline)
To replace substrings of different prefixes with the same value, pass in a set of the prefixes to the prefix parameter.
[35]:
custom_pipeline = [ { "operator": "replace_prefixed", "parameters": {"prefix": {"#", "@"}, "value": "<TAG>"}, } ] clean_text(df, "text", pipeline=custom_pipeline)
To replace different prefixed substrings with different values, chain together replace_prefixed operations.
replace_prefixed
[36]:
custom_pipeline = [ { "operator": "replace_prefixed", "parameters": {"prefix": "#", "value": "<HASHTAG>"}, }, { "operator": "replace_prefixed", "parameters": {"prefix": "@", "value": "<MENTION>"}, }, ] clean_text(df, "text", pipeline=custom_pipeline)
Replace all punctuation marks defined in string.punctuation with the value.
[37]:
custom_pipeline = [ {"operator": "replace_punctuation", "parameters": {"value": "<PUNC>"}} ] clean_text(df, "text", pipeline=custom_pipeline)
Replace common words with the value. By default, the set of stopwords to replace is NLTK’s English stopwords.
[38]:
custom_pipeline = [{"operator": "replace_stopwords", "parameters": {"value": "<S>"}}] clean_text(df, "text", pipeline=custom_pipeline)
[39]:
custom_pipeline = [ { "operator": "replace_stopwords", "parameters": {"stopwords": {"imdb", "film"}, "value": "<S>"}, } ] clean_text(df, "text", pipeline=custom_pipeline)
[40]:
from dataprep.assets.english_stopwords import english_stopwords custom_stopwords = english_stopwords.copy() custom_stopwords.add("imdb") custom_stopwords.add("film") custom_pipeline = [ { "operator": "replace_stopwords", "parameters": { "stopwords": custom_stopwords, "value": "<S>" }, } ] clean_text(df, "text", pipeline=custom_pipeline)
Replace a sequence of characters with another according to the mapping specified in the value parameter. By default, block is set to True and only blocks of text, i.e. tokens composed solely of the specified sequence of characters, are replaced.
[41]:
custom_pipeline = [ { "operator": "replace_text", "parameters": {"value": {"imdb": "Netflix", "film": "movie"}}, } ] clean_text(df, "text", pipeline=custom_pipeline)
To replace the sequence of characters wherever they appear in the text, set block to False.
[42]:
custom_pipeline = [ { "operator": "replace_text", "parameters": {"value": {"imdb": "Netflix", "film": "movie"}, "block": False}, } ] clean_text(df, "text", pipeline=custom_pipeline)
Replace URLs with the value. Substrings that start with “http” or “www” are considered URLs.
[43]:
custom_pipeline = [{"operator": "replace_urls", "parameters": {"value": "<URL>"}}] clean_text(df, "text", pipeline=custom_pipeline)