Text¶

Introduction¶

The function clean_text() cleans text data in a DataFrame column.

Using a default or customized pipeline, the function performs a series of cleaning operations on the data.

The following sections demonstrate the functionality of clean_text().

An example dirty dataset¶

[1]:

import numpy as np
import pandas as pd
pd.set_option("display.max_colwidth", None)

df = pd.DataFrame(
    {
        "text": [
            "'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.",
            "The cast played Shakespeare.<br /><br />Shakespeare lost.",
            "Simon of the Desert (Simón del desierto) is a 1965 film directed by Luis Buñuel.",
            "[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects (!), etc...}",
            "<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968:\tA video essay</a>",
            "Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.",
            "#GameOfThrones: Season 8 is #Rotten at 54% on the #Tomatometer.  But does it deserve to be?",
            "Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3",
            123,
            np.nan,
            "NULL",
        ]
    }
)
df

[1]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert (Simón del desierto) is a 1965 film directed by Luis Buñuel.
3	[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968:\tA video essay</a>
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	#GameOfThrones: Season 8 is #Rotten at 54% on the #Tomatometer. But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

1. Default `clean_text()`¶

The default pipeline for the clean_text() function is the following:

fillna: Replace all null values with NaN.
lowercase: Convert all characters to lowercase.
remove_digits: Remove numbers.
remove_html Remove HTML tags.
remove_urls: Remove URLs.
remove_punctuation: Remove punctuation marks.
remove_accents: Remove accent marks.
remove_stopwords: Remove stopwords.
remove_whitespace: Remove extra spaces, and tabs and newlines.

[2]:

from dataprep.clean import clean_text
clean_text(df, "text")

[2]:

	text
0	zzzzz imdb would allow one word reviews mine would
1	cast played shakespeare shakespeare lost
2	simon desert simon del desierto film directed luis bunuel
3	spoilers think seen film bad acting script effects etc
4	cannes video essay
5	recap thread rottentomatoes excellent panel hosted erikdavis filmfatale nyc ashcrossan
6	gameofthrones season rotten tomatometer deserve
7	come join share thoughts week episode
8
9	NaN
10	NaN

By default, the stopwords removed are the set of words in NLTK’s English stopwords. To remove a different set of words, pass the set into the stopwords parameter.

[3]:

from dataprep.clean import clean_text
clean_text(df, "text", stopwords={"imdb", "film"})

[3]:

	text
0	zzzzz if would allow one word reviews that s what mine would be
1	the cast played shakespeare shakespeare lost
2	simon of the desert simon del desierto is a directed by luis bunuel
3	spoilers i don t think i ve seen a this bad before acting script effects etc
4	cannes a video essay
5	recap thread for rottentomatoes excellent panel hosted by erikdavis with filmfatale nyc and ashcrossan
6	gameofthrones season is rotten at on the tomatometer but does it deserve to be
7	come join and share your thoughts on this week s episode
8
9	NaN
10	NaN

2. Custom pipeline¶

Users can pass in a custom pipeline to clean_text() using the pipeline parameter.

[4]:

custom_pipeline = [
    {"operator": "lowercase"},
    {"operator": "remove_digits"},
    {"operator": "remove_whitespace"},
]
clean_text(df, "text", pipeline=custom_pipeline)

[4]:

	text
0	'zzzzz!' if imdb would allow one-word reviews, that's what mine would be.
1	the cast played shakespeare.<br /><br />shakespeare lost.
2	simon of the desert (simón del desierto) is a film directed by luis buñuel.
3	[spoilers] i don't think i've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes--a-video-essay'>cannes : a video essay</a>
5	recap thread for @rottentomatoes excellent panel, hosted by @erikdavis with @filmfatale_nyc and @ashcrossan.
6	#gameofthrones: season is #rotten at % on the #tomatometer. but does it deserve to be?
7	come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/fakeurl
8
9	NaN
10	null

Users can also define and pass in their own functions using the pipeline parameter.

[5]:

import re

def split(text: str) -> str:
    return str(text).split()

def replace_z(text: str, value: str) -> str:
    return re.sub(r"z", value, str(text), flags=re.I)

custom_pipeline = [
    {"operator": "lowercase"},
    {"operator": "remove_digits"},
    {"operator": split},
    {"operator": replace_z, "parameters": {"value": "*"}},
    {"operator": "remove_whitespace"},
]
clean_text(df, "text", pipeline=custom_pipeline)

[5]:

	text
0	["'*****!'", 'if', 'imdb', 'would', 'allow', 'one-word', 'reviews,', "that's", 'what', 'mine', 'would', 'be.']
1	['the', 'cast', 'played', 'shakespeare.<br', '/><br', '/>shakespeare', 'lost.']
2	['simon', 'of', 'the', 'desert', '(simón', 'del', 'desierto)', 'is', 'a', 'film', 'directed', 'by', 'luis', 'buñuel.']
3	['[spoilers]', 'i', "don't", 'think', "i've", 'seen', 'a', 'film', 'this', 'bad', 'before', '{acting,', 'script,', 'effects', '(!),', 'etc...}']
4	['<a', "href='/festivals/cannes--a-video-essay'>cannes", ':', 'a', 'video', 'essay</a>']
5	['recap', 'thread', 'for', '@rottentomatoes', 'excellent', 'panel,', 'hosted', 'by', '@erikdavis', 'with', '@filmfatale_nyc', 'and', '@ashcrossan.']
6	['#gameofthrones:', 'season', 'is', '#rotten', 'at', '%', 'on', 'the', '#tomatometer.', 'but', 'does', 'it', 'deserve', 'to', 'be?']
7	['come', 'join', 'and', 'share', 'your', 'thoughts', 'on', 'this', "week's", 'episode:', 'https://twitter.com/i/spaces/fakeurl']
8	[]
9	['nan']
10	['null']

In general, custom pipelines can be defined using the form:

[6]:

custom_pipeline = [
    {
        "operator": "<operator_name>",
        "parameters": {"<parameter_name>": "<parameter_value>"},
    }
]

To get the default pipeline in the form of a list, call default_text_pipeline().

This can be used as a template to build a list of cleaning operations to be passed into the pipeline parameter.

[7]:

from dataprep.clean import default_text_pipeline
default_text_pipeline()

[7]:

[{'operator': 'fillna'},
 {'operator': 'lowercase'},
 {'operator': 'remove_digits'},
 {'operator': 'remove_html'},
 {'operator': 'remove_urls'},
 {'operator': 'remove_punctuation'},
 {'operator': 'remove_accents'},
 {'operator': 'remove_stopwords', 'parameters': {'stopwords': None}},
 {'operator': 'remove_whitespace'}]

3. Built-in functions¶

This section demonstrates the built-in cleaning operations which can be called using the pipeline parameter.

clean_text() assumes the DataFrame column contains text data. As such, any int values will be cast to str after applying a cleaning function.

fillna¶

By default, fillna replaces all null values with NaN.

[8]:

custom_pipeline = [{"operator": "fillna"}]
clean_text(df, "text", pipeline=custom_pipeline)

[8]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert (Simón del desierto) is a 1965 film directed by Luis Buñuel.
3	[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968:\tA video essay</a>
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	#GameOfThrones: Season 8 is #Rotten at 54% on the #Tomatometer. But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NaN

To specify a specific value to replace null values, use the value parameter.

[9]:

custom_pipeline = [{"operator": "fillna", "parameters": {"value": "<NAN>"}}]
clean_text(df, "text", pipeline=custom_pipeline)

[9]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert (Simón del desierto) is a 1965 film directed by Luis Buñuel.
3	[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968:\tA video essay</a>
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	#GameOfThrones: Season 8 is #Rotten at 54% on the #Tomatometer. But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	<NAN>
10	<NAN>

lowercase¶

Convert all characters to lowercase.

[10]:

custom_pipeline = [{"operator": "lowercase"}]
clean_text(df, "text", pipeline=custom_pipeline)

[10]:

	text
0	'zzzzz!' if imdb would allow one-word reviews, that's what mine would be.
1	the cast played shakespeare.<br /><br />shakespeare lost.
2	simon of the desert (simón del desierto) is a 1965 film directed by luis buñuel.
3	[spoilers]\ni don't think i've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>cannes 1968:\ta video essay</a>
5	recap thread for @rottentomatoes excellent panel, hosted by @erikdavis with @filmfatale_nyc and @ashcrossan.
6	#gameofthrones: season 8 is #rotten at 54% on the #tomatometer. but does it deserve to be?
7	come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2url3
8	123
9	NaN
10	null

sentence_case¶

Convert the first character of the string to uppercase and all remaining characters to lowercase.

[11]:

custom_pipeline = [{"operator": "sentence_case"}]
clean_text(df, "text", pipeline=custom_pipeline)

[11]:

	text
0	'zzzzz!' if imdb would allow one-word reviews, that's what mine would be.
1	The cast played shakespeare.<br /><br />shakespeare lost.
2	Simon of the desert (simón del desierto) is a 1965 film directed by luis buñuel.
3	[spoilers]\ni don't think i've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>cannes 1968:\ta video essay</a>
5	Recap thread for @rottentomatoes excellent panel, hosted by @erikdavis with @filmfatale_nyc and @ashcrossan.
6	#gameofthrones: season 8 is #rotten at 54% on the #tomatometer. but does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2url3
8	123
9	NaN
10	Null

title_case¶

Convert the first character of each word to uppercase and the remaining words to lowercase.

[12]:

custom_pipeline = [{"operator": "title_case"}]
clean_text(df, "text", pipeline=custom_pipeline)

[12]:

	text
0	'Zzzzz!' If Imdb Would Allow One-Word Reviews, That'S What Mine Would Be.
1	The Cast Played Shakespeare.<Br /><Br />Shakespeare Lost.
2	Simon Of The Desert (Simón Del Desierto) Is A 1965 Film Directed By Luis Buñuel.
3	[Spoilers]\nI Don'T Think I'Ve Seen A Film This Bad Before {Acting, Script, Effects (!), Etc...}
4	<A Href='/Festivals/Cannes-1968-A-Video-Essay'>Cannes 1968:\tA Video Essay</A>
5	Recap Thread For @Rottentomatoes Excellent Panel, Hosted By @Erikdavis With @Filmfatale_Nyc And @Ashcrossan.
6	#Gameofthrones: Season 8 Is #Rotten At 54% On The #Tomatometer. But Does It Deserve To Be?
7	Come Join And Share Your Thoughts On This Week'S Episode: Https://Twitter.Com/I/Spaces/1Fake2Url3
8	123
9	NaN
10	Null

uppercase¶

Convert all characters to uppercase.

[13]:

custom_pipeline = [{"operator": "uppercase"}]
clean_text(df, "text", pipeline=custom_pipeline)

[13]:

	text
0	'ZZZZZ!' IF IMDB WOULD ALLOW ONE-WORD REVIEWS, THAT'S WHAT MINE WOULD BE.
1	THE CAST PLAYED SHAKESPEARE.<BR /><BR />SHAKESPEARE LOST.
2	SIMON OF THE DESERT (SIMÓN DEL DESIERTO) IS A 1965 FILM DIRECTED BY LUIS BUÑUEL.
3	[SPOILERS]\nI DON'T THINK I'VE SEEN A FILM THIS BAD BEFORE {ACTING, SCRIPT, EFFECTS (!), ETC...}
4	<A HREF='/FESTIVALS/CANNES-1968-A-VIDEO-ESSAY'>CANNES 1968:\tA VIDEO ESSAY</A>
5	RECAP THREAD FOR @ROTTENTOMATOES EXCELLENT PANEL, HOSTED BY @ERIKDAVIS WITH @FILMFATALE_NYC AND @ASHCROSSAN.
6	#GAMEOFTHRONES: SEASON 8 IS #ROTTEN AT 54% ON THE #TOMATOMETER. BUT DOES IT DESERVE TO BE?
7	COME JOIN AND SHARE YOUR THOUGHTS ON THIS WEEK'S EPISODE: HTTPS://TWITTER.COM/I/SPACES/1FAKE2URL3
8	123
9	NaN
10	NULL

remove_accents¶

Remove accents (diacritic marks) from the text.

[14]:

custom_pipeline = [{"operator": "remove_accents"}]
clean_text(df, "text", pipeline=custom_pipeline)

[14]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert (Simon del desierto) is a 1965 film directed by Luis Bunuel.
3	[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968:\tA video essay</a>
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	#GameOfThrones: Season 8 is #Rotten at 54% on the #Tomatometer. But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

remove_bracketed¶

Remove text between brackets.

The style of the brackets can be specified using the brackets parameter:

“angle”: <>
“curly”: {}
“round”: ()
“square”: []

By default, the inclusive parameter is set to True and the brackets are removed along with the text in between.

[15]:

custom_pipeline = [
    {"operator": "remove_bracketed", "parameters": {"brackets": "round"}}
]
clean_text(df, "text", pipeline=custom_pipeline)

[15]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert is a 1965 film directed by Luis Buñuel.
3	[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects , etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968:\tA video essay</a>
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	#GameOfThrones: Season 8 is #Rotten at 54% on the #Tomatometer. But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

To remove the text but keep the brackets, set inclusive to False.

[16]:

custom_pipeline = [
    {
        "operator": "remove_bracketed",
        "parameters": {"brackets": "round", "inclusive": False},
    }
]
clean_text(df, "text", pipeline=custom_pipeline)

[16]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert () is a 1965 film directed by Luis Buñuel.
3	[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects (), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968:\tA video essay</a>
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	#GameOfThrones: Season 8 is #Rotten at 54% on the #Tomatometer. But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

The brackets parameter can also take in a set, which allows multiple bracket styles to be specified at a time.

[17]:

custom_pipeline = [
    {
        "operator": "remove_bracketed",
        "parameters": {"brackets": {"angle", "curly", "round", "square"}},
    }
]
clean_text(df, "text", pipeline=custom_pipeline)

[17]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.Shakespeare lost.
2	Simon of the Desert is a 1965 film directed by Luis Buñuel.
3	\nI don't think I've seen a film this bad before
4	Cannes 1968:\tA video essay
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	#GameOfThrones: Season 8 is #Rotten at 54% on the #Tomatometer. But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

remove_digits¶

Remove all digits.

[18]:

custom_pipeline = [{"operator": "remove_digits"}]
clean_text(df, "text", pipeline=custom_pipeline)

[18]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert (Simón del desierto) is a film directed by Luis Buñuel.
3	[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes--a-video-essay'>Cannes :\tA video essay</a>
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	#GameOfThrones: Season is #Rotten at % on the #Tomatometer. But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/fakeURL
8
9	NaN
10	NULL

remove_html¶

Remove HTML tags, including the non-breaking space  .

[19]:

custom_pipeline = [{"operator": "remove_html"}]
clean_text(df, "text", pipeline=custom_pipeline)

[19]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.Shakespeare lost.
2	Simon of the Desert (Simón del desierto) is a 1965 film directed by Luis Buñuel.
3	[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects (!), etc...}
4	Cannes 1968:\tA video essay
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	#GameOfThrones: Season 8 is #Rotten at 54% on the #Tomatometer. But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

remove_prefixed¶

Remove substrings that start with the prefix(es) specified in the prefix parameter.

[20]:

custom_pipeline = [{"operator": "remove_prefixed", "parameters": {"prefix": "#"}}]
clean_text(df, "text", pipeline=custom_pipeline)

[20]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert (Simón del desierto) is a 1965 film directed by Luis Buñuel.
3	[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968:\tA video essay</a>
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	Season 8 is at 54% on the But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

To specify multiple prefixes, pass in a set of the prefixes to the prefix parameter.

[21]:

custom_pipeline = [
    {"operator": "remove_prefixed", "parameters": {"prefix": {"#", "@"}}}
]
clean_text(df, "text", pipeline=custom_pipeline)

[21]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert (Simón del desierto) is a 1965 film directed by Luis Buñuel.
3	[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968:\tA video essay</a>
5	Recap thread for excellent panel, hosted by with and
6	Season 8 is at 54% on the But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

remove_puncutation¶

Remove all punctuation marks defined in Python’s string.punctuation.

[22]:

custom_pipeline = [{"operator": "remove_punctuation"}]
clean_text(df, "text", pipeline=custom_pipeline)

[22]:

	text
0	ZZZZZ If IMDb would allow one word reviews that s what mine would be
1	The cast played Shakespeare br br Shakespeare lost
2	Simon of the Desert Simón del desierto is a 1965 film directed by Luis Buñuel
3	SPOILERS \nI don t think I ve seen a film this bad before acting script effects etc
4	a href festivals cannes 1968 a video essay Cannes 1968 \tA video essay a
5	Recap thread for RottenTomatoes excellent panel hosted by ErikDavis with FilmFatale NYC and AshCrossan
6	GameOfThrones Season 8 is Rotten at 54 on the Tomatometer But does it deserve to be
7	Come join and share your thoughts on this week s episode https twitter com i spaces 1fake2URL3
8	123
9	NaN
10	NULL

remove_stopwords¶

Remove common words. By default, the set of stopwords to remove is NLTK’s English stopwords.

[23]:

custom_pipeline = [{"operator": "remove_stopwords"}]
clean_text(df, "text", pipeline=custom_pipeline)

[23]:

	text
0	'ZZZZZ!' IMDb would allow one-word reviews, that's mine would be.
1	cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon Desert (Simón del desierto) 1965 film directed Luis Buñuel.
3	[SPOILERS] think I've seen film bad {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968: video essay</a>
5	Recap thread @RottenTomatoes excellent panel, hosted @ErikDavis @FilmFatale_NYC @AshCrossan.
6	#GameOfThrones: Season 8 #Rotten 54% #Tomatometer. deserve be?
7	Come join share thoughts week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

To use a custom set of words, pass the set into the stopwords parameter.

[24]:

custom_pipeline = [
    {"operator": "remove_stopwords", "parameters": {"stopwords": {"imdb", "film"}}}
]
clean_text(df, "text", pipeline=custom_pipeline)

[24]:

	text
0	'ZZZZZ!' If would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert (Simón del desierto) is a 1965 directed by Luis Buñuel.
3	[SPOILERS] I don't think I've seen a this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968: A video essay</a>
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	#GameOfThrones: Season 8 is #Rotten at 54% on the #Tomatometer. But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

Alternatively, expand upon the default set of stopwords by importing dataprep.assets.english_stopwords and adding custom words.

[25]:

from dataprep.assets.english_stopwords import english_stopwords
custom_stopwords = english_stopwords.copy()
custom_stopwords.add("imdb")
custom_stopwords.add("film")

custom_pipeline = [
    {
        "operator": "remove_stopwords",
        "parameters": {"stopwords": custom_stopwords},
    }
]
clean_text(df, "text", pipeline=custom_pipeline)

[25]:

	text
0	'ZZZZZ!' would allow one-word reviews, that's mine would be.
1	cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon Desert (Simón del desierto) 1965 directed Luis Buñuel.
3	[SPOILERS] think I've seen bad {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968: video essay</a>
5	Recap thread @RottenTomatoes excellent panel, hosted @ErikDavis @FilmFatale_NYC @AshCrossan.
6	#GameOfThrones: Season 8 #Rotten 54% #Tomatometer. deserve be?
7	Come join share thoughts week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

remove_urls¶

Remove URLs. Substrings that start with “http” or “www” are considered URLs.

[26]:

custom_pipeline = [{"operator": "remove_urls"}]
clean_text(df, "text", pipeline=custom_pipeline)

[26]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert (Simón del desierto) is a 1965 film directed by Luis Buñuel.
3	[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968:\tA video essay</a>
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	#GameOfThrones: Season 8 is #Rotten at 54% on the #Tomatometer. But does it deserve to be?
7	Come join and share your thoughts on this week's episode:
8	123
9	NaN
10	NULL

remove_whitespace¶

Remove extra spaces (two or more) along with tabs and newlines. Leading and trailing spaces are also removed.

[27]:

custom_pipeline = [{"operator": "remove_whitespace"}]
clean_text(df, "text", pipeline=custom_pipeline)

[27]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert (Simón del desierto) is a 1965 film directed by Luis Buñuel.
3	[SPOILERS] I don't think I've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968: A video essay</a>
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	#GameOfThrones: Season 8 is #Rotten at 54% on the #Tomatometer. But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

replace_bracketed¶

Replace text between brackets with the value.

The style of the brackets can be specified using the brackets parameter:

“angle”: <>
“curly”: {}
“round”: ()
“square”: []

By default, the inclusive parameter is set to True and the brackets are also replaced by the value along with the text in between.

[28]:

custom_pipeline = [
    {
        "operator": "replace_bracketed",
        "parameters": {"brackets": "square", "value": "**SPOILERS**"},
    }
]
clean_text(df, "text", pipeline=custom_pipeline)

[28]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert (Simón del desierto) is a 1965 film directed by Luis Buñuel.
3	SPOILERS\nI don't think I've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968:\tA video essay</a>
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	#GameOfThrones: Season 8 is #Rotten at 54% on the #Tomatometer. But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

To replace the text, but keep the brackets, set inclusive to False.

[29]:

custom_pipeline = [
    {
        "operator": "replace_bracketed",
        "parameters": {
            "brackets": "square",
            "value": "**SPOILERS**",
            "inclusive": False,
        },
    }
]
clean_text(df, "text", pipeline=custom_pipeline)

[29]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert (Simón del desierto) is a 1965 film directed by Luis Buñuel.
3	[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968:\tA video essay</a>
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	#GameOfThrones: Season 8 is #Rotten at 54% on the #Tomatometer. But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

The brackets parameter can also take in a set, which allows multiple bracket styles to be specified at a time.

[30]:

custom_pipeline = [
    {
        "operator": "replace_bracketed",
        "parameters": {
            "brackets": {"angle", "curly", "round", "square"},
            "value": "<REDACTED>",
        },
    }
]
clean_text(df, "text", pipeline=custom_pipeline)

[30]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<REDACTED><REDACTED>Shakespeare lost.
2	Simon of the Desert <REDACTED> is a 1965 film directed by Luis Buñuel.
3	<REDACTED>\nI don't think I've seen a film this bad before <REDACTED>
4	<REDACTED>Cannes 1968:\tA video essay<REDACTED>
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	#GameOfThrones: Season 8 is #Rotten at 54% on the #Tomatometer. But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

To assign different replacement values to different bracket styles, chain together replace_bracketed operations.

[31]:

custom_pipeline = [
    {
        "operator": "replace_bracketed",
        "parameters": {
            "brackets": "square",
            "value": "**SPOILERS**",
        },
    },
    {
        "operator": "replace_bracketed",
        "parameters": {
            "brackets": "curly",
            "value": "in every aspect.",
        },
    },
]
clean_text(df, "text", pipeline=custom_pipeline)

[31]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert (Simón del desierto) is a 1965 film directed by Luis Buñuel.
3	SPOILERS\nI don't think I've seen a film this bad before in every aspect.
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968:\tA video essay</a>
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	#GameOfThrones: Season 8 is #Rotten at 54% on the #Tomatometer. But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

replace_digits¶

Replace all digits with the value. By default, the block parameter is set to True and only blocks of digits, i.e. tokens composed solely of numbers, are removed.

[32]:

custom_pipeline = [{"operator": "replace_digits", "parameters": {"value": "X"}}]
clean_text(df, "text", pipeline=custom_pipeline)

[32]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert (Simón del desierto) is a X film directed by Luis Buñuel.
3	[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-X-a-video-essay'>Cannes X:\tA video essay</a>
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	#GameOfThrones: Season X is #Rotten at X% on the #Tomatometer. But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	X
9	NaN
10	NULL

To replace all digits appearing in the text, set block to False.

[33]:

custom_pipeline = [
    {"operator": "replace_digits", "parameters": {"value": "X", "block": False}}
]
clean_text(df, "text", pipeline=custom_pipeline)

[33]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert (Simón del desierto) is a X film directed by Luis Buñuel.
3	[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-X-a-video-essay'>Cannes X:\tA video essay</a>
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	#GameOfThrones: Season X is #Rotten at X% on the #Tomatometer. But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/XfakeXURLX
8	X
9	NaN
10	NULL

replace_prefixed¶

Replace all substrings that start with the prefix(es) specified in the prefix parameter with the value.

[34]:

custom_pipeline = [
    {
        "operator": "replace_prefixed",
        "parameters": {"prefix": "#", "value": "<HASHTAG>"},
    }
]
clean_text(df, "text", pipeline=custom_pipeline)

[34]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert (Simón del desierto) is a 1965 film directed by Luis Buñuel.
3	[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968:\tA video essay</a>
5	Recap thread for @RottenTomatoes excellent panel, hosted by @ErikDavis with @FilmFatale_NYC and @AshCrossan.
6	<HASHTAG> Season 8 is <HASHTAG> at 54% on the <HASHTAG> But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

To replace substrings of different prefixes with the same value, pass in a set of the prefixes to the prefix parameter.

[35]:

custom_pipeline = [
    {
        "operator": "replace_prefixed",
        "parameters": {"prefix": {"#", "@"}, "value": "<TAG>"},
    }
]
clean_text(df, "text", pipeline=custom_pipeline)

[35]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert (Simón del desierto) is a 1965 film directed by Luis Buñuel.
3	[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968:\tA video essay</a>
5	Recap thread for <TAG> excellent panel, hosted by <TAG> with <TAG> and <TAG>
6	<TAG> Season 8 is <TAG> at 54% on the <TAG> But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

To replace different prefixed substrings with different values, chain together replace_prefixed operations.

[36]:

custom_pipeline = [
    {
        "operator": "replace_prefixed",
        "parameters": {"prefix": "#", "value": "<HASHTAG>"},
    },
    {
        "operator": "replace_prefixed",
        "parameters": {"prefix": "@", "value": "<MENTION>"},
    },
]
clean_text(df, "text", pipeline=custom_pipeline)

[36]:

	text
0	'ZZZZZ!' If IMDb would allow one-word reviews, that's what mine would be.
1	The cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon of the Desert (Simón del desierto) is a 1965 film directed by Luis Buñuel.
3	[SPOILERS]\nI don't think I've seen a film this bad before {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968:\tA video essay</a>
5	Recap thread for <MENTION> excellent panel, hosted by <MENTION> with <MENTION> and <MENTION>
6	<HASHTAG> Season 8 is <HASHTAG> at 54% on the <HASHTAG> But does it deserve to be?
7	Come join and share your thoughts on this week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL

replace_punctuation¶

Replace all punctuation marks defined in string.punctuation with the value.

[37]:

custom_pipeline = [
    {"operator": "replace_punctuation", "parameters": {"value": "<PUNC>"}}
]
clean_text(df, "text", pipeline=custom_pipeline)

[37]:

	text
0	<PUNC>ZZZZZ<PUNC><PUNC> If IMDb would allow one<PUNC>word reviews<PUNC> that<PUNC>s what mine would be<PUNC>
1	The cast played Shakespeare<PUNC><PUNC>br <PUNC><PUNC><PUNC>br <PUNC><PUNC>Shakespeare lost<PUNC>
2	Simon of the Desert <PUNC>Simón del desierto<PUNC> is a 1965 film directed by Luis Buñuel<PUNC>
3	<PUNC>SPOILERS<PUNC>\nI don<PUNC>t think I<PUNC>ve seen a film this bad before <PUNC>acting<PUNC> script<PUNC> effects <PUNC><PUNC><PUNC><PUNC> etc<PUNC><PUNC><PUNC><PUNC>
4	<PUNC>a href<PUNC><PUNC><PUNC>festivals<PUNC>cannes<PUNC>1968<PUNC>a<PUNC>video<PUNC>essay<PUNC><PUNC>Cannes 1968<PUNC>\tA video essay<PUNC><PUNC>a<PUNC>
5	Recap thread for <PUNC>RottenTomatoes excellent panel<PUNC> hosted by <PUNC>ErikDavis with <PUNC>FilmFatale<PUNC>NYC and <PUNC>AshCrossan<PUNC>
6	<PUNC>GameOfThrones<PUNC> Season 8 is <PUNC>Rotten at 54<PUNC> on the <PUNC>Tomatometer<PUNC> But does it deserve to be<PUNC>
7	Come join and share your thoughts on this week<PUNC>s episode<PUNC> https<PUNC><PUNC><PUNC>twitter<PUNC>com<PUNC>i<PUNC>spaces<PUNC>1fake2URL3
8	123
9	NaN
10	NULL

replace_stopwords¶

Replace common words with the value. By default, the set of stopwords to replace is NLTK’s English stopwords.

[38]:

custom_pipeline = [{"operator": "replace_stopwords", "parameters": {"value": "<S>"}}]
clean_text(df, "text", pipeline=custom_pipeline)

[38]:

	text
0	'ZZZZZ!' <S> IMDb would allow one-word reviews, that's <S> mine would be.
1	<S> cast played Shakespeare.<br /><br />Shakespeare lost.
2	Simon <S> <S> Desert (Simón del desierto) <S> <S> 1965 film directed <S> Luis Buñuel.
3	[SPOILERS] <S> <S> think I've seen <S> film <S> bad <S> {acting, script, effects (!), etc...}
4	<a href='/festivals/cannes-1968-a-video-essay'>Cannes 1968: <S> video essay</a>
5	Recap thread <S> @RottenTomatoes excellent panel, hosted <S> @ErikDavis <S> @FilmFatale_NYC <S> @AshCrossan.
6	#GameOfThrones: Season 8 <S> #Rotten <S> 54% <S> <S> #Tomatometer. <S> <S> <S> deserve <S> be?
7	Come join <S> share <S> thoughts <S> <S> week's episode: https://twitter.com/i/spaces/1fake2URL3
8	123
9	NaN
10	NULL