URLs¶

Introduction¶

The function clean_url() cleans a DataFrame column containing urls, and extracts the important parameters including cleaned path, queries, scheme, etc. The function validate_url() validates either a single url or a column of urls, returning True if the value is valid, and False otherwise.

clean_url() extracts the important features of the url and creates an additional column containing key value pairs of the parameters. It extracts the following features:

scheme (string)
host (string)
cleaned path (string)
queries (key-value pairs)

Remove authentication tokens: Sometimes we would like to remove certain sensitive information which is usually contained in a url for e.g. access_tokens, user information, etc. clean_url() provides us with an option to remove this information with the remove_auth parameter. The usage of all parameters is explained in depth in the sections below.

Invalid parsing is handled with the errors parameter:

“coerce” (default): invalid parsing will be set to NaN
“ignore”: invalid parsing will return the input
“raise”: invalid parsing will raise an exception

After cleaning, a report is printed that provides the following information:

How many values were cleaned (the value must have been transformed).
How many values could not be parsed.
A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.

The following sections demonstrate the functionality of clean_url() and validate_url().

[1]:

import pandas as pd
import numpy as np
df = pd.DataFrame({
    "url": [
        "random text which is not a url",
        "http://www.facebookee.com/otherpath?auth=facebookeeauth&token=iwusdkc&not_token=hiThere&another_token=12323423",
        "https://www.sfu.ca/ficticiouspath?auth=sampletoken1&studentid=1234&loc=van",
        "notaurl",
        np.nan,
        None,
        "https://www.sfu.ca/ficticiouspath?auth=sampletoken2&studentid=1230&loc=bur",
        "",
        {
            "not_a_url": True
        },
        "2345678",
        345345345,
        "https://www.sfu.ca/ficticiouspath?auth=sampletoken3&studentid=1231&loc=sur",
        "https://www.sfu.ca/ficticiouspath?auth=sampletoken1&studentid=1232&loc=van",
    ]
})
df

[1]:

	url
0	random text which is not a url
1	http://www.facebookee.com/otherpath?auth=faceb...
2	https://www.sfu.ca/ficticiouspath?auth=samplet...
3	notaurl
4	NaN
5	None
6	https://www.sfu.ca/ficticiouspath?auth=samplet...
7
8	{'not_a_url': True}
9	2345678
10	345345345
11	https://www.sfu.ca/ficticiouspath?auth=samplet...
12	https://www.sfu.ca/ficticiouspath?auth=samplet...

1. default: `clean_url()`¶

By default, the parameteres are set as inplace = False, split = False, remove_auth = False, report = True,errors = coerce.

[2]:

from dataprep.clean import clean_url
df_default = clean_url(df, column="url")
df_default

URL Cleaning Report:
        5 values parsed (38.46%)
        5 values unable to be parsed (38.46%), set to NaN
Result contains 5 (38.46%) parsed key-value pairs and 8 null values (61.54%)

[2]:

	url	url_details
0	random text which is not a url	NaN
1	http://www.facebookee.com/otherpath?auth=faceb...	{'scheme': 'http', 'host': 'www.facebookee.com...
2	https://www.sfu.ca/ficticiouspath?auth=samplet...	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...
3	notaurl	NaN
4	NaN	NaN
5	None	NaN
6	https://www.sfu.ca/ficticiouspath?auth=samplet...	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...
7		NaN
8	{'not_a_url': True}	NaN
9	2345678	NaN
10	345345345	NaN
11	https://www.sfu.ca/ficticiouspath?auth=samplet...	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...
12	https://www.sfu.ca/ficticiouspath?auth=samplet...	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...

We can see that in the new dataframe df_default a new column is created url_details, this follows the naming convention of orginal_column_name**_details** (url_details in our case).

Now let us see what one of the cells in url_details looks like.

[3]:

df_default["url_details"][1]

[3]:

{'scheme': 'http',
 'host': 'www.facebookee.com',
 'url_clean': 'http://www.facebookee.com/otherpath',
 'queries': {'auth': 'facebookeeauth',
  'token': 'iwusdkc',
  'not_token': 'hiThere',
  'another_token': '12323423'}}

2. `remove_auth` parameter¶

Sometimes we need to remove sensitive information when parsing a url, we can do this in the clean_url() function by specifying the remove_auth parameter to be True or we can can specify a list of parameters to removed. Hence remove_auth can be a boolean value or list of strings.

When remove_auth is set to the boolean value of True, clean_url() looks for auth tokens based on the default list of token names (provided below) and removes them. When remove_auth is set to list of strings it creates a union of the user provided list and default list to create a new set of token words to be removed.

[4]:

default_list = {
    "access_token",
    "auth_key",
    "auth",
    "password",
    "username",
    "login",
    "token",
    "passcode",
    "access-token",
    "auth-key",
    "authentication",
    "authentication-key",
}

Lets have a look at the same dataframe and the two scenerios described above (by looking at the second row).

a. `remove_auth = True` (boolean)¶

[5]:

df_remove_auth_boolean = clean_url(df, column="url", remove_auth=True)
df_remove_auth_boolean["url_details"][1]

URL Cleaning Report:
        5 values parsed (38.46%)
        5 values unable to be parsed (38.46%), set to NaN
Removed 6 auth queries from 5 rows
Result contains 5 (38.46%) parsed key-value pairs and 8 null values (61.54%)

[5]:

{'scheme': 'http',
 'host': 'www.facebookee.com',
 'url_clean': 'http://www.facebookee.com/otherpath',
 'queries': {'not_token': 'hiThere', 'another_token': '12323423'}}

As we can see queries auth & token were removed from the result but not_token and another_token were included, this is because auth and token were specified in default_list. Also notice the additional line giving the stats on how many queries were removed from how many rows.

b. remove_auth = list of string¶

[6]:

df_remove_auth_list = clean_url(df, column="url", remove_auth=["another_token"])
df_remove_auth_list["url_details"][1]

URL Cleaning Report:
        5 values parsed (38.46%)
        5 values unable to be parsed (38.46%), set to NaN
Removed 7 auth queries from 5 rows
Result contains 5 (38.46%) parsed key-value pairs and 8 null values (61.54%)

[6]:

{'scheme': 'http',
 'host': 'www.facebookee.com',
 'url_clean': 'http://www.facebookee.com/otherpath',
 'queries': {'not_token': 'hiThere'}}

As we can see queries auth, token and another_token were removed but not_token was included in the result, this is because a new list was created by creating a union of default_list and user defined list and queries were removed based on the new combined list

3. `split` parameter¶

The split parameter adds individual columns containing the containing all the extracted features to the given DataFrame.

[7]:

df_remove_split = clean_url(df, column="url", split=True)
df_remove_split

URL Cleaning Report:
        5 values parsed (38.46%)
        5 values unable to be parsed (38.46%), set to NaN
Result contains 5 (38.46%) parsed key-value pairs and 8 null values (61.54%)

[7]:

	url	scheme	host	url_clean	queries
0	random text which is not a url	NaN	NaN	NaN	NaN
1	http://www.facebookee.com/otherpath?auth=faceb...	http	www.facebookee.com	http://www.facebookee.com/otherpath	{'auth': 'facebookeeauth', 'token': 'iwusdkc',...
2	https://www.sfu.ca/ficticiouspath?auth=samplet...	https	www.sfu.ca	https://www.sfu.ca/ficticiouspath	{'auth': 'sampletoken1', 'studentid': '1234', ...
3	notaurl	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN
5	None	NaN	NaN	NaN	NaN
6	https://www.sfu.ca/ficticiouspath?auth=samplet...	https	www.sfu.ca	https://www.sfu.ca/ficticiouspath	{'auth': 'sampletoken2', 'studentid': '1230', ...
7		NaN	NaN	NaN	NaN
8	{'not_a_url': True}	NaN	NaN	NaN	NaN
9	2345678	NaN	NaN	NaN	NaN
10	345345345	NaN	NaN	NaN	NaN
11	https://www.sfu.ca/ficticiouspath?auth=samplet...	https	www.sfu.ca	https://www.sfu.ca/ficticiouspath	{'auth': 'sampletoken3', 'studentid': '1231', ...
12	https://www.sfu.ca/ficticiouspath?auth=samplet...	https	www.sfu.ca	https://www.sfu.ca/ficticiouspath	{'auth': 'sampletoken1', 'studentid': '1232', ...

4. `inplace` parameter¶

Replaces the original column with orginal_column_name_details.

[8]:

df_remove_inplace = clean_url(df, column="url", inplace=True)
df_remove_inplace

URL Cleaning Report:
        5 values parsed (38.46%)
        5 values unable to be parsed (38.46%), set to NaN
Result contains 5 (38.46%) parsed key-value pairs and 8 null values (61.54%)

[8]:

	url_details
0	NaN
1	{'scheme': 'http', 'host': 'www.facebookee.com...
2	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...
3	NaN
4	NaN
5	NaN
6	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...
7	NaN
8	NaN
9	NaN
10	NaN
11	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...
12	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...

5. `split` and `inplace`¶

Replaces the original column with other columns based on the split parameters.

[9]:

df_remove_inplace_split = clean_url(df, column="url", inplace=True, split=True)
df_remove_inplace_split

URL Cleaning Report:
        5 values parsed (38.46%)
        5 values unable to be parsed (38.46%), set to NaN
Result contains 5 (38.46%) parsed key-value pairs and 8 null values (61.54%)

[9]:

	scheme	host	url_clean	queries
0	NaN	NaN	NaN	NaN
1	http	www.facebookee.com	http://www.facebookee.com/otherpath	{'auth': 'facebookeeauth', 'token': 'iwusdkc',...
2	https	www.sfu.ca	https://www.sfu.ca/ficticiouspath	{'auth': 'sampletoken1', 'studentid': '1234', ...
3	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN
6	https	www.sfu.ca	https://www.sfu.ca/ficticiouspath	{'auth': 'sampletoken2', 'studentid': '1230', ...
7	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN
9	NaN	NaN	NaN	NaN
10	NaN	NaN	NaN	NaN
11	https	www.sfu.ca	https://www.sfu.ca/ficticiouspath	{'auth': 'sampletoken3', 'studentid': '1231', ...
12	https	www.sfu.ca	https://www.sfu.ca/ficticiouspath	{'auth': 'sampletoken1', 'studentid': '1232', ...

6. `errors` parameter¶

“coerce” (default), then invalid parsing will be set as NaN
“ignore”, then invalid parsing will return the input
“raise”, then invalid parsing will raise an exception

a. “coerce” (default)¶

This is the default value of the parameters, this sets the invalid parsing to NaN.

[10]:

df_remove_errors_default = clean_url(df, column="url")
df_remove_errors_default

URL Cleaning Report:
        5 values parsed (38.46%)
        5 values unable to be parsed (38.46%), set to NaN
Result contains 5 (38.46%) parsed key-value pairs and 8 null values (61.54%)

[10]:

	url	url_details
0	random text which is not a url	NaN
1	http://www.facebookee.com/otherpath?auth=faceb...	{'scheme': 'http', 'host': 'www.facebookee.com...
2	https://www.sfu.ca/ficticiouspath?auth=samplet...	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...
3	notaurl	NaN
4	NaN	NaN
5	None	NaN
6	https://www.sfu.ca/ficticiouspath?auth=samplet...	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...
7		NaN
8	{'not_a_url': True}	NaN
9	2345678	NaN
10	345345345	NaN
11	https://www.sfu.ca/ficticiouspath?auth=samplet...	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...
12	https://www.sfu.ca/ficticiouspath?auth=samplet...	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...

b. “ignore”¶

This sets the value of invalid parsing as the input.

[11]:

df_remove_errors_ignore = clean_url(df, column="url", errors="ignore")
df_remove_errors_ignore

URL Cleaning Report:
        5 values parsed (38.46%)
        5 values unable to be parsed (38.46%), left unchanged
Result contains 5 (38.46%) parsed key-value pairs and 3 null values (23.08%)

[11]:

	url	url_details
0	random text which is not a url	random text which is not a url
1	http://www.facebookee.com/otherpath?auth=faceb...	{'scheme': 'http', 'host': 'www.facebookee.com...
2	https://www.sfu.ca/ficticiouspath?auth=samplet...	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...
3	notaurl	notaurl
4	NaN	NaN
5	None	NaN
6	https://www.sfu.ca/ficticiouspath?auth=samplet...	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...
7		NaN
8	{'not_a_url': True}	{'not_a_url': True}
9	2345678	2345678
10	345345345	345345345
11	https://www.sfu.ca/ficticiouspath?auth=samplet...	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...
12	https://www.sfu.ca/ficticiouspath?auth=samplet...	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...

c. “raise”¶

This will raise a value error when it encounters an invalid parsing value.

7. `report` parameter¶

By default it is set to True, when set to False it will not display the stats pertaining to the cleaned operations performed.

[12]:

df_remove_auth_boolean = clean_url(df, column="url", remove_auth=True, report=False)
df_remove_auth_boolean

[12]:

	url	url_details
0	random text which is not a url	NaN
1	http://www.facebookee.com/otherpath?auth=faceb...	{'scheme': 'http', 'host': 'www.facebookee.com...
2	https://www.sfu.ca/ficticiouspath?auth=samplet...	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...
3	notaurl	NaN
4	NaN	NaN
5	None	NaN
6	https://www.sfu.ca/ficticiouspath?auth=samplet...	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...
7		NaN
8	{'not_a_url': True}	NaN
9	2345678	NaN
10	345345345	NaN
11	https://www.sfu.ca/ficticiouspath?auth=samplet...	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...
12	https://www.sfu.ca/ficticiouspath?auth=samplet...	{'scheme': 'https', 'host': 'www.sfu.ca', 'url...

8. `validate_url()`¶

validate_url() returns True when the input is a valid url. Otherwise it returns False.

[13]:

from dataprep.clean import validate_url
print(validate_url({"not_a_url" : True}))
print(validate_url(2346789))
print(validate_url("https://www.sfu.ca/ficticiouspath?auth=sampletoken3&studentid=1231&loc=sur"))
print(validate_url("http://www.facebookee.com/otherpath?auth=facebookeeauth&token=iwusdkc&nottoken=hiThere&another_token=12323423"))

False
False
True
True

[14]:

df = pd.DataFrame({
    "url": [
        "random text which is not a url",
        "http://www.facebookee.com/otherpath?auth=facebookeeauth&token=iwusdkc&nottoken=hiThere&another_token=12323423",
        "https://www.sfu.ca/ficticiouspath?auth=sampletoken1&studentid=1234&loc=van",
        "notaurl",
        np.nan,
        None,
        "https://www.sfu.ca/ficticiouspath?auth=sampletoken2&studentid=1230&loc=bur",
        "",
        {
            "not_a_url": True
        },
        "2345678",
        345345345,
        "https://www.sfu.ca/ficticiouspath?auth=sampletoken3&studentid=1231&loc=sur",
        "https://www.sfu.ca/ficticiouspath?auth=sampletoken1&studentid=1232&loc=van",
    ]
})

df["validate_url"] = validate_url(df["url"])
df

[14]:

	url	validate_url
0	random text which is not a url	False
1	http://www.facebookee.com/otherpath?auth=faceb...	True
2	https://www.sfu.ca/ficticiouspath?auth=samplet...	True
3	notaurl	False
4	NaN	False
5	None	False
6	https://www.sfu.ca/ficticiouspath?auth=samplet...	True
7		False
8	{'not_a_url': True}	False
9	2345678	False
10	345345345	False
11	https://www.sfu.ca/ficticiouspath?auth=samplet...	True
12	https://www.sfu.ca/ficticiouspath?auth=samplet...	True

URLs¶

Introduction¶

1. default: clean_url()¶

2. remove_auth parameter¶

a. remove_auth = True (boolean)¶

b. remove_auth = list of string¶

3. split parameter¶

4. inplace parameter¶

5. split and inplace¶

6. errors parameter¶