IP Addresses¶

Introduction¶

The function clean_ip() cleans a column containing IP addresses, and standardizes them in a given format. The function validate_ip() validates either a single IP address or a column of IP addresses, returning True if the value is valid, and False otherwise.

Currently IPv4 and IPv6 are supported as valid input.

The IP addresses can be converted into any of the following desired formats: * compressed: provides a compressed version of the ip address, * full: provides full version of the ip address, * binary: provides binary representation of the ip address, * hexa: provides hexadecimal representation of the ip address, * integer: provides integer representation of the ip address, * packed: provides packed binary representation of the ip address.

The default output format is compressed.

Invalid parsing is handled with the errors parameter:

“coerce” (default): invalid parsing will be set to NaN
“ignore”: invalid parsing will return the input
“raise”: invalid parsing will raise an exception

After cleaning, a report is printed that provides the following information:

How many values were cleaned (the value must have been transformed).
How many values could not be parsed.
A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.

An example dataset containing ip addresses¶

[1]:

import pandas as pd
df = pd.DataFrame({
    "ips": [
        "00.000.0.0", "455.0.0.0", None, 876234, {}, "00.12.021.255",
        "684D:1111:222:3333:4444:5555:6:77", b'\xc9\xdb\x10\x00'
    ]
})
df

[1]:

	ips
0	00.000.0.0
1	455.0.0.0
2	None
3	876234
4	{}
5	00.12.021.255
6	684D:1111:222:3333:4444:5555:6:77
7	b'\xc9\xdb\x10\x00'

1. Default `clean_ip`¶

By default, clean_ip will clean ip addresses in IPv4 and IPv6 and output them in the compressed format.

[2]:

from dataprep.clean import clean_ip
clean_ip(df, "ips")

IP Cleaning Report:
        3 values cleaned (37.5%)
        4 values unable to be parsed (50.0%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)

[2]:

	ips	ips_clean
0	00.000.0.0	NaN
1	455.0.0.0	NaN
2	None	NaN
3	876234	0.13.94.202
4	{}	NaN
5	00.12.021.255	NaN
6	684D:1111:222:3333:4444:5555:6:77	684d:1111:222:3333:4444:5555:6:77
7	b'\xc9\xdb\x10\x00'	201.219.16.0

2. Input formats¶

This section demonstrates the input parameter.

`ipv4`¶

Will parse only IPv4 addresses.

[3]:

clean_ip(df, "ips", input_format="ipv4")

IP Cleaning Report:
        2 values cleaned (25.0%)
        5 values unable to be parsed (62.5%), set to NaN
Result contains 2 (25.0%) values in the correct format and 6 null values (75.0%)

[3]:

	ips	ips_clean
0	00.000.0.0	NaN
1	455.0.0.0	NaN
2	None	NaN
3	876234	0.13.94.202
4	{}	NaN
5	00.12.021.255	NaN
6	684D:1111:222:3333:4444:5555:6:77	NaN
7	b'\xc9\xdb\x10\x00'	201.219.16.0

`ipv6`¶

Will parse only IPv6 address.

[4]:

clean_ip(df, "ips", input_format="ipv6")

IP Cleaning Report:
        1 values cleaned (12.5%)
        6 values unable to be parsed (75.0%), set to NaN
Result contains 1 (12.5%) values in the correct format and 7 null values (87.5%)

[4]:

	ips	ips_clean
0	00.000.0.0	NaN
1	455.0.0.0	NaN
2	None	NaN
3	876234	NaN
4	{}	NaN
5	00.12.021.255	NaN
6	684D:1111:222:3333:4444:5555:6:77	684d:1111:222:3333:4444:5555:6:77
7	b'\xc9\xdb\x10\x00'	NaN

`auto` (default parameter)¶

Will parse both IPv4 and IPv6 addresses.

[5]:

clean_ip(df, "ips", input_format="auto")

IP Cleaning Report:
        3 values cleaned (37.5%)
        4 values unable to be parsed (50.0%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)

[5]:

	ips	ips_clean
0	00.000.0.0	NaN
1	455.0.0.0	NaN
2	None	NaN
3	876234	0.13.94.202
4	{}	NaN
5	00.12.021.255	NaN
6	684D:1111:222:3333:4444:5555:6:77	684d:1111:222:3333:4444:5555:6:77
7	b'\xc9\xdb\x10\x00'	201.219.16.0

3. Output formats¶

`compressed` (default)¶

[6]:

clean_ip(df, "ips", output_format="compressed")

IP Cleaning Report:
        3 values cleaned (37.5%)
        4 values unable to be parsed (50.0%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)

[6]:

	ips	ips_clean
0	00.000.0.0	NaN
1	455.0.0.0	NaN
2	None	NaN
3	876234	0.13.94.202
4	{}	NaN
5	00.12.021.255	NaN
6	684D:1111:222:3333:4444:5555:6:77	684d:1111:222:3333:4444:5555:6:77
7	b'\xc9\xdb\x10\x00'	201.219.16.0

`full`¶

[7]:

clean_ip(df, "ips", output_format="full")

IP Cleaning Report:
        3 values cleaned (37.5%)
        4 values unable to be parsed (50.0%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)

[7]:

	ips	ips_clean
0	00.000.0.0	NaN
1	455.0.0.0	NaN
2	None	NaN
3	876234	0000.0013.0094.0202
4	{}	NaN
5	00.12.021.255	NaN
6	684D:1111:222:3333:4444:5555:6:77	684d:1111:0222:3333:4444:5555:0006:0077
7	b'\xc9\xdb\x10\x00'	0201.0219.0016.0000

`binary`¶

[8]:

clean_ip(df, "ips", output_format="binary")

IP Cleaning Report:
        3 values cleaned (37.5%)
        4 values unable to be parsed (50.0%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)

[8]:

	ips	ips_clean
0	00.000.0.0	NaN
1	455.0.0.0	NaN
2	None	NaN
3	876234	00000000000011010101111011001010
4	{}	NaN
5	00.12.021.255	NaN
6	684D:1111:222:3333:4444:5555:6:77	0110100001001101000100010001000100000010001000...
7	b'\xc9\xdb\x10\x00'	11001001110110110001000000000000

`hexa`¶

[9]:

clean_ip(df, "ips", output_format="hexa")

IP Cleaning Report:
        3 values cleaned (37.5%)
        4 values unable to be parsed (50.0%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)

[9]:

	ips	ips_clean
0	00.000.0.0	NaN
1	455.0.0.0	NaN
2	None	NaN
3	876234	0xd5eca
4	{}	NaN
5	00.12.021.255	NaN
6	684D:1111:222:3333:4444:5555:6:77	0x684d1111022233334444555500060077
7	b'\xc9\xdb\x10\x00'	0xc9db1000

`integer`¶

[10]:

clean_ip(df, "ips", output_format="integer")

IP Cleaning Report:
        2 values cleaned (25.0%)
        4 values unable to be parsed (50.0%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)

[10]:

	ips	ips_clean
0	00.000.0.0	NaN
1	455.0.0.0	NaN
2	None	NaN
3	876234	876234
4	{}	NaN
5	00.12.021.255	NaN
6	684D:1111:222:3333:4444:5555:6:77	138639864568240772614187040837063802999
7	b'\xc9\xdb\x10\x00'	3386576896

`packed`¶

[11]:

clean_ip(df, "ips", output_format="packed")

IP Cleaning Report:
        2 values cleaned (25.0%)
        4 values unable to be parsed (50.0%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)

[11]:

	ips	ips_clean
0	00.000.0.0	NaN
1	455.0.0.0	NaN
2	None	NaN
3	876234	b'\x00\r^\xca'
4	{}	NaN
5	00.12.021.255	NaN
6	684D:1111:222:3333:4444:5555:6:77	b'hM\x11\x11\x02"33DDUU\x00\x06\x00w'
7	b'\xc9\xdb\x10\x00'	b'\xc9\xdb\x10\x00'

3. `errors` parameter¶

`coerce` (default)¶

[12]:

clean_ip(df, "ips", errors="coerce")

IP Cleaning Report:
        3 values cleaned (37.5%)
        4 values unable to be parsed (50.0%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)

[12]:

	ips	ips_clean
0	00.000.0.0	NaN
1	455.0.0.0	NaN
2	None	NaN
3	876234	0.13.94.202
4	{}	NaN
5	00.12.021.255	NaN
6	684D:1111:222:3333:4444:5555:6:77	684d:1111:222:3333:4444:5555:6:77
7	b'\xc9\xdb\x10\x00'	201.219.16.0

`ignore`¶

[13]:

clean_ip(df, "ips", errors="ignore")

IP Cleaning Report:
        3 values cleaned (37.5%)
        4 values unable to be parsed (50.0%), left unchanged
Result contains 3 (37.5%) values in the correct format and 1 null values (12.5%)

[13]:

	ips	ips_clean
0	00.000.0.0	00.000.0.0
1	455.0.0.0	455.0.0.0
2	None	NaN
3	876234	0.13.94.202
4	{}	{}
5	00.12.021.255	00.12.021.255
6	684D:1111:222:3333:4444:5555:6:77	684d:1111:222:3333:4444:5555:6:77
7	b'\xc9\xdb\x10\x00'	201.219.16.0

4. `validate_ip()`¶

validate_ip() returns True if the input is a valid IP, otherwise False.

[14]:

from dataprep.clean import validate_ip

print(validate_ip("455.0.0.0"))
print(validate_ip({}))
print(validate_ip(" "))
print(validate_ip("0.0.0.0"))
print(validate_ip("684D:1111:222:3333:4444:5555:6:77"))

False
False
False
True
True

[15]:

df_2 = validate_ip(df["ips"])
df_2

[15]:

0    False
1    False
2    False
3     True
4    False
5    False
6     True
7     True
Name: ips, dtype: bool

[ ]:

IP Addresses¶

Introduction¶

An example dataset containing ip addresses¶

1. Default clean_ip¶

2. Input formats¶

ipv4¶

ipv6¶

auto (default parameter)¶

3. Output formats¶

compressed (default)¶

full¶

binary¶

hexa¶

integer¶

packed¶

3. errors parameter¶

coerce (default)¶

ignore¶

4. validate_ip()¶