The function clean_ip() cleans a column containing IP addresses, and standardizes them in a given format. The function validate_ip() validates either a single IP address or a column of IP addresses, returning True if the value is valid, and False otherwise.
clean_ip()
validate_ip()
Currently IPv4 and IPv6 are supported as valid input.
The IP addresses can be converted into any of the following desired formats: * compressed: provides a compressed version of the ip address, * full: provides full version of the ip address, * binary: provides binary representation of the ip address, * hexa: provides hexadecimal representation of the ip address, * integer: provides integer representation of the ip address, * packed: provides packed binary representation of the ip address.
compressed
full
binary
hexa
integer
packed
The default output format is compressed.
Invalid parsing is handled with the errors parameter:
errors
“coerce” (default): invalid parsing will be set to NaN
“ignore”: invalid parsing will return the input
“raise”: invalid parsing will raise an exception
After cleaning, a report is printed that provides the following information:
How many values were cleaned (the value must have been transformed).
How many values could not be parsed.
A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.
[1]:
import pandas as pd df = pd.DataFrame({ "ips": [ "00.000.0.0", "455.0.0.0", None, 876234, {}, "00.12.021.255", "684D:1111:222:3333:4444:5555:6:77", b'\xc9\xdb\x10\x00' ] }) df
clean_ip
By default, clean_ip will clean ip addresses in IPv4 and IPv6 and output them in the compressed format.
[2]:
from dataprep.clean import clean_ip clean_ip(df, "ips")
IP Cleaning Report: 3 values cleaned (37.5%) 4 values unable to be parsed (50.0%), set to NaN Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)
This section demonstrates the input parameter.
ipv4
Will parse only IPv4 addresses.
[3]:
clean_ip(df, "ips", input_format="ipv4")
IP Cleaning Report: 2 values cleaned (25.0%) 5 values unable to be parsed (62.5%), set to NaN Result contains 2 (25.0%) values in the correct format and 6 null values (75.0%)
ipv6
Will parse only IPv6 address.
[4]:
clean_ip(df, "ips", input_format="ipv6")
IP Cleaning Report: 1 values cleaned (12.5%) 6 values unable to be parsed (75.0%), set to NaN Result contains 1 (12.5%) values in the correct format and 7 null values (87.5%)
auto
Will parse both IPv4 and IPv6 addresses.
[5]:
clean_ip(df, "ips", input_format="auto")
[6]:
clean_ip(df, "ips", output_format="compressed")
[7]:
clean_ip(df, "ips", output_format="full")
[8]:
clean_ip(df, "ips", output_format="binary")
[9]:
clean_ip(df, "ips", output_format="hexa")
[10]:
clean_ip(df, "ips", output_format="integer")
IP Cleaning Report: 2 values cleaned (25.0%) 4 values unable to be parsed (50.0%), set to NaN Result contains 3 (37.5%) values in the correct format and 5 null values (62.5%)
[11]:
clean_ip(df, "ips", output_format="packed")
coerce
[12]:
clean_ip(df, "ips", errors="coerce")
ignore
[13]:
clean_ip(df, "ips", errors="ignore")
IP Cleaning Report: 3 values cleaned (37.5%) 4 values unable to be parsed (50.0%), left unchanged Result contains 3 (37.5%) values in the correct format and 1 null values (12.5%)
validate_ip() returns True if the input is a valid IP, otherwise False.
True
False
[14]:
from dataprep.clean import validate_ip print(validate_ip("455.0.0.0")) print(validate_ip({})) print(validate_ip(" ")) print(validate_ip("0.0.0.0")) print(validate_ip("684D:1111:222:3333:4444:5555:6:77"))
False False False True True
[15]:
df_2 = validate_ip(df["ips"]) df_2
0 False 1 False 2 False 3 True 4 False 5 False 6 True 7 True Name: ips, dtype: bool
[ ]: