IP Addresses

Introduction

The function clean_ip() cleans a column containing IP addresses, and standardizes them in a given format. The function validate_ip() validates either a single IP address or a column of IP addresses, returning True if the value is valid, and False otherwise.

Currently IPv4 and IPv6 are supported as valid input.

The IP addresses can be converted into any of the following desired formats: * compressed: provides a compressed version of the ip address, * full: provides full version of the ip address, * binary: provides binary representation of the ip address, * hexa: provides hexadecimal representation of the ip address, * integer: provides integer representation of the ip address, * packed: provides packed binary representation of the ip address.

The default output format is compressed.

Invalid parsing is handled with the errors parameter:

  • “coerce” (default): invalid parsing will be set to NaN

  • “ignore”: invalid parsing will return the input

  • “raise”: invalid parsing will raise an exception

After cleaning, a report is printed that provides the following information:

  • How many values were cleaned (the value must have been transformed).

  • How many values could not be parsed.

  • A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.

An example dataset containing ip addresses

[1]:
import pandas as pd
df = pd.DataFrame({
    "ips": [
        "00.000.0.0", "455.0.0.0", None, 876234, {}, "00.12.021.255",
        "684D:1111:222:3333:4444:5555:6:77", b'\xc9\xdb\x10\x00'
    ]
})
df
[1]:
ips
0 00.000.0.0
1 455.0.0.0
2 None
3 876234
4 {}
5 00.12.021.255
6 684D:1111:222:3333:4444:5555:6:77
7 b'\xc9\xdb\x10\x00'

1. Default clean_ip

By default, clean_ip will clean ip addresses in IPv4 and IPv6 and output them in the compressed format.

[2]:
from dataprep.clean import clean_ip
clean_ip(df, "ips")
IP Cleaning Report:
        3 values cleaned (37.5%)
        4 values unable to be parsed (50.0%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)
[2]:
ips ips_clean
0 00.000.0.0 NaN
1 455.0.0.0 NaN
2 None NaN
3 876234 0.13.94.202
4 {} NaN
5 00.12.021.255 NaN
6 684D:1111:222:3333:4444:5555:6:77 684d:1111:222:3333:4444:5555:6:77
7 b'\xc9\xdb\x10\x00' 201.219.16.0

2. Input formats

This section demonstrates the input parameter.

ipv4

Will parse only IPv4 addresses.

[3]:
clean_ip(df, "ips", input_format="ipv4")
IP Cleaning Report:
        2 values cleaned (25.0%)
        5 values unable to be parsed (62.5%), set to NaN
Result contains 2 (25.0%) values in the correct format and 6 null values (75.0%)
[3]:
ips ips_clean
0 00.000.0.0 NaN
1 455.0.0.0 NaN
2 None NaN
3 876234 0.13.94.202
4 {} NaN
5 00.12.021.255 NaN
6 684D:1111:222:3333:4444:5555:6:77 NaN
7 b'\xc9\xdb\x10\x00' 201.219.16.0

ipv6

Will parse only IPv6 address.

[4]:
clean_ip(df, "ips", input_format="ipv6")
IP Cleaning Report:
        1 values cleaned (12.5%)
        6 values unable to be parsed (75.0%), set to NaN
Result contains 1 (12.5%) values in the correct format and 7 null values (87.5%)
[4]:
ips ips_clean
0 00.000.0.0 NaN
1 455.0.0.0 NaN
2 None NaN
3 876234 NaN
4 {} NaN
5 00.12.021.255 NaN
6 684D:1111:222:3333:4444:5555:6:77 684d:1111:222:3333:4444:5555:6:77
7 b'\xc9\xdb\x10\x00' NaN

auto (default parameter)

Will parse both IPv4 and IPv6 addresses.

[5]:
clean_ip(df, "ips", input_format="auto")
IP Cleaning Report:
        3 values cleaned (37.5%)
        4 values unable to be parsed (50.0%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)
[5]:
ips ips_clean
0 00.000.0.0 NaN
1 455.0.0.0 NaN
2 None NaN
3 876234 0.13.94.202
4 {} NaN
5 00.12.021.255 NaN
6 684D:1111:222:3333:4444:5555:6:77 684d:1111:222:3333:4444:5555:6:77
7 b'\xc9\xdb\x10\x00' 201.219.16.0

3. Output formats

compressed (default)

[6]:
clean_ip(df, "ips", output_format="compressed")
IP Cleaning Report:
        3 values cleaned (37.5%)
        4 values unable to be parsed (50.0%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)
[6]:
ips ips_clean
0 00.000.0.0 NaN
1 455.0.0.0 NaN
2 None NaN
3 876234 0.13.94.202
4 {} NaN
5 00.12.021.255 NaN
6 684D:1111:222:3333:4444:5555:6:77 684d:1111:222:3333:4444:5555:6:77
7 b'\xc9\xdb\x10\x00' 201.219.16.0

full

[7]:
clean_ip(df, "ips", output_format="full")
IP Cleaning Report:
        3 values cleaned (37.5%)
        4 values unable to be parsed (50.0%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)
[7]:
ips ips_clean
0 00.000.0.0 NaN
1 455.0.0.0 NaN
2 None NaN
3 876234 0000.0013.0094.0202
4 {} NaN
5 00.12.021.255 NaN
6 684D:1111:222:3333:4444:5555:6:77 684d:1111:0222:3333:4444:5555:0006:0077
7 b'\xc9\xdb\x10\x00' 0201.0219.0016.0000

binary

[8]:
clean_ip(df, "ips", output_format="binary")
IP Cleaning Report:
        3 values cleaned (37.5%)
        4 values unable to be parsed (50.0%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)
[8]:
ips ips_clean
0 00.000.0.0 NaN
1 455.0.0.0 NaN
2 None NaN
3 876234 00000000000011010101111011001010
4 {} NaN
5 00.12.021.255 NaN
6 684D:1111:222:3333:4444:5555:6:77 0110100001001101000100010001000100000010001000...
7 b'\xc9\xdb\x10\x00' 11001001110110110001000000000000

hexa

[9]:
clean_ip(df, "ips", output_format="hexa")
IP Cleaning Report:
        3 values cleaned (37.5%)
        4 values unable to be parsed (50.0%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)
[9]:
ips ips_clean
0 00.000.0.0 NaN
1 455.0.0.0 NaN
2 None NaN
3 876234 0xd5eca
4 {} NaN
5 00.12.021.255 NaN
6 684D:1111:222:3333:4444:5555:6:77 0x684d1111022233334444555500060077
7 b'\xc9\xdb\x10\x00' 0xc9db1000

integer

[10]:
clean_ip(df, "ips", output_format="integer")
IP Cleaning Report:
        2 values cleaned (25.0%)
        4 values unable to be parsed (50.0%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)
[10]:
ips ips_clean
0 00.000.0.0 NaN
1 455.0.0.0 NaN
2 None NaN
3 876234 876234
4 {} NaN
5 00.12.021.255 NaN
6 684D:1111:222:3333:4444:5555:6:77 138639864568240772614187040837063802999
7 b'\xc9\xdb\x10\x00' 3386576896

packed

[11]:
clean_ip(df, "ips", output_format="packed")
IP Cleaning Report:
        2 values cleaned (25.0%)
        4 values unable to be parsed (50.0%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)
[11]:
ips ips_clean
0 00.000.0.0 NaN
1 455.0.0.0 NaN
2 None NaN
3 876234 b'\x00\r^\xca'
4 {} NaN
5 00.12.021.255 NaN
6 684D:1111:222:3333:4444:5555:6:77 b'hM\x11\x11\x02"33DDUU\x00\x06\x00w'
7 b'\xc9\xdb\x10\x00' b'\xc9\xdb\x10\x00'

3. errors parameter

coerce (default)

[12]:
clean_ip(df, "ips", errors="coerce")
IP Cleaning Report:
        3 values cleaned (37.5%)
        4 values unable to be parsed (50.0%), set to NaN
Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)
[12]:
ips ips_clean
0 00.000.0.0 NaN
1 455.0.0.0 NaN
2 None NaN
3 876234 0.13.94.202
4 {} NaN
5 00.12.021.255 NaN
6 684D:1111:222:3333:4444:5555:6:77 684d:1111:222:3333:4444:5555:6:77
7 b'\xc9\xdb\x10\x00' 201.219.16.0

ignore

[13]:
clean_ip(df, "ips", errors="ignore")
IP Cleaning Report:
        3 values cleaned (37.5%)
        4 values unable to be parsed (50.0%), left unchanged
Result contains 3 (37.5%) values in the correct format and 1 null values (12.5%)
[13]:
ips ips_clean
0 00.000.0.0 00.000.0.0
1 455.0.0.0 455.0.0.0
2 None NaN
3 876234 0.13.94.202
4 {} {}
5 00.12.021.255 00.12.021.255
6 684D:1111:222:3333:4444:5555:6:77 684d:1111:222:3333:4444:5555:6:77
7 b'\xc9\xdb\x10\x00' 201.219.16.0

4. validate_ip()

validate_ip() returns True if the input is a valid IP, otherwise False.

[14]:
from dataprep.clean import validate_ip

print(validate_ip("455.0.0.0"))
print(validate_ip({}))
print(validate_ip(" "))
print(validate_ip("0.0.0.0"))
print(validate_ip("684D:1111:222:3333:4444:5555:6:77"))
False
False
False
True
True
[15]:
df_2 = validate_ip(df["ips"])
df_2
[15]:
0    False
1    False
2    False
3     True
4    False
5    False
6     True
7     True
Name: ips, dtype: bool
[ ]: