The function clean_url() cleans a DataFrame column containing urls, and extracts the important parameters including cleaned path, queries, scheme, etc. The function validate_url() validates either a single url or a column of urls, returning True if the value is valid, and False otherwise.
clean_url()
validate_url()
clean_url() extracts the important features of the url and creates an additional column containing key value pairs of the parameters. It extracts the following features:
scheme (string)
host (string)
cleaned path (string)
queries (key-value pairs)
Remove authentication tokens: Sometimes we would like to remove certain sensitive information which is usually contained in a url for e.g. access_tokens, user information, etc. clean_url() provides us with an option to remove this information with the remove_auth parameter. The usage of all parameters is explained in depth in the sections below.
remove_auth
Invalid parsing is handled with the errors parameter:
errors
“coerce” (default): invalid parsing will be set to NaN
“ignore”: invalid parsing will return the input
“raise”: invalid parsing will raise an exception
After cleaning, a report is printed that provides the following information:
How many values were cleaned (the value must have been transformed).
How many values could not be parsed.
A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.
The following sections demonstrate the functionality of clean_url() and validate_url().
[1]:
import pandas as pd import numpy as np df = pd.DataFrame({ "url": [ "random text which is not a url", "http://www.facebookee.com/otherpath?auth=facebookeeauth&token=iwusdkc¬_token=hiThere&another_token=12323423", "https://www.sfu.ca/ficticiouspath?auth=sampletoken1&studentid=1234&loc=van", "notaurl", np.nan, None, "https://www.sfu.ca/ficticiouspath?auth=sampletoken2&studentid=1230&loc=bur", "", { "not_a_url": True }, "2345678", 345345345, "https://www.sfu.ca/ficticiouspath?auth=sampletoken3&studentid=1231&loc=sur", "https://www.sfu.ca/ficticiouspath?auth=sampletoken1&studentid=1232&loc=van", ] }) df
By default, the parameteres are set as inplace = False, split = False, remove_auth = False, report = True,errors = coerce.
inplace = False
split = False
remove_auth = False
report = True
errors = coerce
[2]:
from dataprep.clean import clean_url df_default = clean_url(df, column="url") df_default
URL Cleaning Report: 5 values parsed (38.46%) 5 values unable to be parsed (38.46%), set to NaN Result contains 5 (38.46%) parsed key-value pairs and 8 null values (61.54%)
We can see that in the new dataframe df_default a new column is created url_details, this follows the naming convention of orginal_column_name**_details** (url_details in our case).
df_default
url_details
orginal_column_name
Now let us see what one of the cells in url_details looks like.
[3]:
df_default["url_details"][1]
{'scheme': 'http', 'host': 'www.facebookee.com', 'url_clean': 'http://www.facebookee.com/otherpath', 'queries': {'auth': 'facebookeeauth', 'token': 'iwusdkc', 'not_token': 'hiThere', 'another_token': '12323423'}}
Sometimes we need to remove sensitive information when parsing a url, we can do this in the clean_url() function by specifying the remove_auth parameter to be True or we can can specify a list of parameters to removed. Hence remove_auth can be a boolean value or list of strings.
boolean
When remove_auth is set to the boolean value of True, clean_url() looks for auth tokens based on the default list of token names (provided below) and removes them. When remove_auth is set to list of strings it creates a union of the user provided list and default list to create a new set of token words to be removed.
True
[4]:
default_list = { "access_token", "auth_key", "auth", "password", "username", "login", "token", "passcode", "access-token", "auth-key", "authentication", "authentication-key", }
Lets have a look at the same dataframe and the two scenerios described above (by looking at the second row).
remove_auth = True
[5]:
df_remove_auth_boolean = clean_url(df, column="url", remove_auth=True) df_remove_auth_boolean["url_details"][1]
URL Cleaning Report: 5 values parsed (38.46%) 5 values unable to be parsed (38.46%), set to NaN Removed 6 auth queries from 5 rows Result contains 5 (38.46%) parsed key-value pairs and 8 null values (61.54%)
{'scheme': 'http', 'host': 'www.facebookee.com', 'url_clean': 'http://www.facebookee.com/otherpath', 'queries': {'not_token': 'hiThere', 'another_token': '12323423'}}
As we can see queries auth & token were removed from the result but not_token and another_token were included, this is because auth and token were specified in default_list. Also notice the additional line giving the stats on how many queries were removed from how many rows.
auth
token
not_token
another_token
default_list
[6]:
df_remove_auth_list = clean_url(df, column="url", remove_auth=["another_token"]) df_remove_auth_list["url_details"][1]
URL Cleaning Report: 5 values parsed (38.46%) 5 values unable to be parsed (38.46%), set to NaN Removed 7 auth queries from 5 rows Result contains 5 (38.46%) parsed key-value pairs and 8 null values (61.54%)
{'scheme': 'http', 'host': 'www.facebookee.com', 'url_clean': 'http://www.facebookee.com/otherpath', 'queries': {'not_token': 'hiThere'}}
As we can see queries auth, token and another_token were removed but not_token was included in the result, this is because a new list was created by creating a union of default_list and user defined list and queries were removed based on the new combined list
split
The split parameter adds individual columns containing the containing all the extracted features to the given DataFrame.
[7]:
df_remove_split = clean_url(df, column="url", split=True) df_remove_split
inplace
Replaces the original column with orginal_column_name_details.
[8]:
df_remove_inplace = clean_url(df, column="url", inplace=True) df_remove_inplace
Replaces the original column with other columns based on the split parameters.
[9]:
df_remove_inplace_split = clean_url(df, column="url", inplace=True, split=True) df_remove_inplace_split
“coerce” (default), then invalid parsing will be set as NaN
“ignore”, then invalid parsing will return the input
“raise”, then invalid parsing will raise an exception
This is the default value of the parameters, this sets the invalid parsing to NaN.
[10]:
df_remove_errors_default = clean_url(df, column="url") df_remove_errors_default
This sets the value of invalid parsing as the input.
[11]:
df_remove_errors_ignore = clean_url(df, column="url", errors="ignore") df_remove_errors_ignore
URL Cleaning Report: 5 values parsed (38.46%) 5 values unable to be parsed (38.46%), left unchanged Result contains 5 (38.46%) parsed key-value pairs and 3 null values (23.08%)
This will raise a value error when it encounters an invalid parsing value.
report
By default it is set to True, when set to False it will not display the stats pertaining to the cleaned operations performed.
False
[12]:
df_remove_auth_boolean = clean_url(df, column="url", remove_auth=True, report=False) df_remove_auth_boolean
validate_url() returns True when the input is a valid url. Otherwise it returns False.
[13]:
from dataprep.clean import validate_url print(validate_url({"not_a_url" : True})) print(validate_url(2346789)) print(validate_url("https://www.sfu.ca/ficticiouspath?auth=sampletoken3&studentid=1231&loc=sur")) print(validate_url("http://www.facebookee.com/otherpath?auth=facebookeeauth&token=iwusdkc¬token=hiThere&another_token=12323423"))
False False True True
[14]:
df = pd.DataFrame({ "url": [ "random text which is not a url", "http://www.facebookee.com/otherpath?auth=facebookeeauth&token=iwusdkc¬token=hiThere&another_token=12323423", "https://www.sfu.ca/ficticiouspath?auth=sampletoken1&studentid=1234&loc=van", "notaurl", np.nan, None, "https://www.sfu.ca/ficticiouspath?auth=sampletoken2&studentid=1230&loc=bur", "", { "not_a_url": True }, "2345678", 345345345, "https://www.sfu.ca/ficticiouspath?auth=sampletoken3&studentid=1231&loc=sur", "https://www.sfu.ca/ficticiouspath?auth=sampletoken1&studentid=1232&loc=van", ] }) df["validate_url"] = validate_url(df["url"]) df