The function clean_lat_long() cleans and standardizes a DataFrame column containing latitude and/or longitude coordinates. The function validate_lat_long() validates either a single coordinate or a column of coordinates, returning True if the value is valid, and False otherwise.
clean_lat_long()
validate_lat_long()
The following latitude and longitude formats are supported by the output_format parameter:
output_format
Decimal degrees (dd): 41.5
Decimal degrees hemisphere (ddh): “41.5° N”
Degrees minutes (dm): “41° 30′ N”
Degrees minutes seconds (dms): “41° 30′ 0″ N”
You can split a column of geographic coordinates into one column for latitude and another for longitude by setting the parameter split to True.
split
Invalid parsing is handled with the errors parameter:
errors
“coerce” (default): invalid parsing will be set to NaN
“ignore”: invalid parsing will return the input
“raise”: invalid parsing will raise an exception
After cleaning, a report is printed that provides the following information:
How many values were cleaned (the value must have been transformed).
How many values could not be parsed.
A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.
The following sections demonstrate the functionality of clean_lat_long() and validate_lat_long().
[1]:
import pandas as pd import numpy as np df = pd.DataFrame({ "lat_long": [(41.5, -81.0), "41.5;-81.0", "41.5,-81.0", "41.5 -81.0", "41.5° N, 81.0° W", "41.5 S;81.0 E", "-41.5 S;81.0 E", "23 26m 22s N 23 27m 30s E", "23 26' 22\" N 23 27' 30\" E", "UT: N 39°20' 0'' / W 74°35' 0''", "hello", np.nan, "NULL"] }) df
By default, the output_format parameter is set to “dd” (decimal degrees) and the errors parameter is set to “coerce” (set to NaN when parsing is invalid).
[2]:
from dataprep.clean import clean_lat_long clean_lat_long(df, "lat_long")
Latitude and Longitude Cleaning Report: 8 values cleaned (61.54%) 2 values unable to be parsed (15.38%), set to NaN Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)
Note (41.5, -81.0) is considered not cleaned in the report since it’s resulting format is the same as the input. Also, “-41.5 S;81.0 E” is invalid because if a coordinate has a hemisphere it cannot contain a negative decimal value.
This section demonstrates the supported latitudinal and longitudinal formats.
[3]:
clean_lat_long(df, "lat_long", output_format="ddh")
[4]:
clean_lat_long(df, "lat_long", output_format="dm")
Latitude and Longitude Cleaning Report: 9 values cleaned (69.23%) 2 values unable to be parsed (15.38%), set to NaN Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)
[5]:
clean_lat_long(df, "lat_long", output_format="dms")
The split parameter adds individual columns containing the cleaned latitude and longitude values to the given DataFrame.
[6]:
clean_lat_long(df, "lat_long", split=True)
Split can be used along with different output formats.
[7]:
clean_lat_long(df, "lat_long", split=True, output_format="dm")
inplace
This just deletes the given column from the returned dataframe. A new column containing cleaned coordinates is added with a title in the format "{original title}_clean".
"{original title}_clean"
[8]:
clean_lat_long(df, "lat_long", inplace=True)
[9]:
clean_lat_long(df, "lat_long", split=True, inplace=True)
[10]:
df = pd.DataFrame({"lat": [" 30′ 0″ E", "41° 30′ N", "41 S", "80", "hello", "NA"]}) clean_lat_long(df, lat_col="lat")
Latitude and Longitude Cleaning Report: 3 values cleaned (50.0%) 2 values unable to be parsed (33.33%), set to NaN Result contains 3 (50.0%) values in the correct format and 3 null values (50.0%)
Latitude and longitude values are counted separately in the report.
[11]:
df = pd.DataFrame({"lat": ["30° E", "41° 30′ N", "41 S", "80", "hello", "NA"], "long": ["30° E", "41° 30′ N", "41 W", "80", "hello", "NA"]}) clean_lat_long(df, lat_col="lat", long_col="long")
Latitude and Longitude Cleaning Report: 6 values cleaned (100.0%) Result contains 6 (100.0%) values in the correct format and 0 null values (0.0%)
[12]:
clean_lat_long(df, lat_col="lat", long_col="long", split=True)
validate_lat_long() returns True when the input is a valid latitude or longitude value otherwise it returns False. Valid types are the same as clean_lat_long().
[13]:
from dataprep.clean import validate_lat_long print(validate_lat_long("41° 30′ 0″ N")) print(validate_lat_long("41.5 S;81.0 E")) print(validate_lat_long("-41.5 S;81.0 E")) print(validate_lat_long((41.5, 81))) print(validate_lat_long(41.5, lat_long=False, lat=True))
False True False True True
[14]:
df = pd.DataFrame({"lat_long": [(41.5, -81.0), "41.5;-81.0", "41.5,-81.0", "41.5 -81.0", "41.5° N, 81.0° W", "-41.5 S;81.0 E", "23 26m 22s N 23 27m 30s E", "23 26' 22\" N 23 27' 30\" E", "UT: N 39°20' 0'' / W 74°35' 0''", "hello", np.nan, "NULL"] }) validate_lat_long(df["lat_long"])
0 True 1 True 2 True 3 True 4 True 5 False 6 True 7 True 8 True 9 False 10 False 11 False Name: lat_long, dtype: bool
[15]:
df = pd.DataFrame({"lat": [41.5, "41.5", "41.5 ", "41.5° N", "-41.5 S", "23 26m 22s N", "23 26' 22\" N", "UT: N 39°20' 0''", "hello", np.nan, "NULL"] }) validate_lat_long(df["lat"], lat_long=False, lat=True)
0 True 1 True 2 True 3 True 4 False 5 True 6 True 7 True 8 False 9 False 10 False Name: lat, dtype: bool