Geographic Coordinates

Introduction

The function clean_lat_long() cleans and standardizes a DataFrame column containing latitude and/or longitude coordinates. The function validate_lat_long() validates either a single coordinate or a column of coordinates, returning True if the value is valid, and False otherwise.

The following latitude and longitude formats are supported by the output_format parameter:

  • Decimal degrees (dd): 41.5

  • Decimal degrees hemisphere (ddh): “41.5° N”

  • Degrees minutes (dm): “41° 30′ N”

  • Degrees minutes seconds (dms): “41° 30′ 0″ N”

You can split a column of geographic coordinates into one column for latitude and another for longitude by setting the parameter split to True.

Invalid parsing is handled with the errors parameter:

  • “coerce” (default): invalid parsing will be set to NaN

  • “ignore”: invalid parsing will return the input

  • “raise”: invalid parsing will raise an exception

After cleaning, a report is printed that provides the following information:

  • How many values were cleaned (the value must have been transformed).

  • How many values could not be parsed.

  • A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.

The following sections demonstrate the functionality of clean_lat_long() and validate_lat_long().

An example dataset with geographic coordinates

[1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({
    "lat_long":
    [(41.5, -81.0), "41.5;-81.0", "41.5,-81.0", "41.5 -81.0",
     "41.5° N, 81.0° W", "41.5 S;81.0 E", "-41.5 S;81.0 E",
     "23 26m 22s N 23 27m 30s E", "23 26' 22\" N 23 27' 30\" E",
     "UT: N 39°20' 0'' / W 74°35' 0''", "hello", np.nan, "NULL"]
})
df
[1]:
lat_long
0 (41.5, -81.0)
1 41.5;-81.0
2 41.5,-81.0
3 41.5 -81.0
4 41.5° N, 81.0° W
5 41.5 S;81.0 E
6 -41.5 S;81.0 E
7 23 26m 22s N 23 27m 30s E
8 23 26' 22" N 23 27' 30" E
9 UT: N 39°20' 0'' / W 74°35' 0''
10 hello
11 NaN
12 NULL

1. Default clean_lat_long()

By default, the output_format parameter is set to “dd” (decimal degrees) and the errors parameter is set to “coerce” (set to NaN when parsing is invalid).

[2]:
from dataprep.clean import clean_lat_long
clean_lat_long(df, "lat_long")
Latitude and Longitude Cleaning Report:
        8 values cleaned (61.54%)
        2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)
[2]:
lat_long lat_long_clean
0 (41.5, -81.0) (41.5, -81.0)
1 41.5;-81.0 (41.5, -81.0)
2 41.5,-81.0 (41.5, -81.0)
3 41.5 -81.0 (41.5, -81.0)
4 41.5° N, 81.0° W (41.5, -81.0)
5 41.5 S;81.0 E (-41.5, 81.0)
6 -41.5 S;81.0 E NaN
7 23 26m 22s N 23 27m 30s E (23.4394, 23.4583)
8 23 26' 22" N 23 27' 30" E (23.4394, 23.4583)
9 UT: N 39°20' 0'' / W 74°35' 0'' (39.3333, -74.5833)
10 hello NaN
11 NaN NaN
12 NULL NaN

Note (41.5, -81.0) is considered not cleaned in the report since it’s resulting format is the same as the input. Also, “-41.5 S;81.0 E” is invalid because if a coordinate has a hemisphere it cannot contain a negative decimal value.

2. Output formats

This section demonstrates the supported latitudinal and longitudinal formats.

decimal degrees hemisphere (ddh)

[3]:
clean_lat_long(df, "lat_long", output_format="ddh")
Latitude and Longitude Cleaning Report:
        8 values cleaned (61.54%)
        2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)
[3]:
lat_long lat_long_clean
0 (41.5, -81.0) 41.5° N, 81.0° W
1 41.5;-81.0 41.5° N, 81.0° W
2 41.5,-81.0 41.5° N, 81.0° W
3 41.5 -81.0 41.5° N, 81.0° W
4 41.5° N, 81.0° W 41.5° N, 81.0° W
5 41.5 S;81.0 E 41.5° S, 81.0° E
6 -41.5 S;81.0 E NaN
7 23 26m 22s N 23 27m 30s E 23.4394° N, 23.4583° E
8 23 26' 22" N 23 27' 30" E 23.4394° N, 23.4583° E
9 UT: N 39°20' 0'' / W 74°35' 0'' 39.3333° N, 74.5833° W
10 hello NaN
11 NaN NaN
12 NULL NaN

degrees minutes (dm)

[4]:
clean_lat_long(df, "lat_long", output_format="dm")
Latitude and Longitude Cleaning Report:
        9 values cleaned (69.23%)
        2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)
[4]:
lat_long lat_long_clean
0 (41.5, -81.0) 41° 30′ N, 81° 0′ W
1 41.5;-81.0 41° 30′ N, 81° 0′ W
2 41.5,-81.0 41° 30′ N, 81° 0′ W
3 41.5 -81.0 41° 30′ N, 81° 0′ W
4 41.5° N, 81.0° W 41° 30′ N, 81° 0′ W
5 41.5 S;81.0 E 41° 30′ S, 81° 0′ E
6 -41.5 S;81.0 E NaN
7 23 26m 22s N 23 27m 30s E 23° 26.3667′ N, 23° 27.5′ E
8 23 26' 22" N 23 27' 30" E 23° 26.3667′ N, 23° 27.5′ E
9 UT: N 39°20' 0'' / W 74°35' 0'' 39° 20′ N, 74° 35′ W
10 hello NaN
11 NaN NaN
12 NULL NaN

degrees minutes seconds (dms)

[5]:
clean_lat_long(df, "lat_long", output_format="dms")
Latitude and Longitude Cleaning Report:
        9 values cleaned (69.23%)
        2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)
[5]:
lat_long lat_long_clean
0 (41.5, -81.0) 41° 30′ 0″ N, 81° 0′ 0″ W
1 41.5;-81.0 41° 30′ 0″ N, 81° 0′ 0″ W
2 41.5,-81.0 41° 30′ 0″ N, 81° 0′ 0″ W
3 41.5 -81.0 41° 30′ 0″ N, 81° 0′ 0″ W
4 41.5° N, 81.0° W 41° 30′ 0″ N, 81° 0′ 0″ W
5 41.5 S;81.0 E 41° 30′ 0″ S, 81° 0′ 0″ E
6 -41.5 S;81.0 E NaN
7 23 26m 22s N 23 27m 30s E 23° 26′ 22″ N, 23° 27′ 30″ E
8 23 26' 22" N 23 27' 30" E 23° 26′ 22″ N, 23° 27′ 30″ E
9 UT: N 39°20' 0'' / W 74°35' 0'' 39° 20′ 0″ N, 74° 34′ 60″ W
10 hello NaN
11 NaN NaN
12 NULL NaN

3. split parameter

The split parameter adds individual columns containing the cleaned latitude and longitude values to the given DataFrame.

[6]:
clean_lat_long(df, "lat_long", split=True)
Latitude and Longitude Cleaning Report:
        9 values cleaned (69.23%)
        2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)
[6]:
lat_long latitude longitude
0 (41.5, -81.0) 41.5000 -81.0000
1 41.5;-81.0 41.5000 -81.0000
2 41.5,-81.0 41.5000 -81.0000
3 41.5 -81.0 41.5000 -81.0000
4 41.5° N, 81.0° W 41.5000 -81.0000
5 41.5 S;81.0 E -41.5000 81.0000
6 -41.5 S;81.0 E NaN NaN
7 23 26m 22s N 23 27m 30s E 23.4394 23.4583
8 23 26' 22" N 23 27' 30" E 23.4394 23.4583
9 UT: N 39°20' 0'' / W 74°35' 0'' 39.3333 -74.5833
10 hello NaN NaN
11 NaN NaN NaN
12 NULL NaN NaN

Split can be used along with different output formats.

[7]:
clean_lat_long(df, "lat_long", split=True, output_format="dm")
Latitude and Longitude Cleaning Report:
        9 values cleaned (69.23%)
        2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)
[7]:
lat_long latitude longitude
0 (41.5, -81.0) 41° 30′ N 81° 0′ W
1 41.5;-81.0 41° 30′ N 81° 0′ W
2 41.5,-81.0 41° 30′ N 81° 0′ W
3 41.5 -81.0 41° 30′ N 81° 0′ W
4 41.5° N, 81.0° W 41° 30′ N 81° 0′ W
5 41.5 S;81.0 E 41° 30′ S 81° 0′ E
6 -41.5 S;81.0 E NaN NaN
7 23 26m 22s N 23 27m 30s E 23° 26.3667′ N 23° 27.5′ E
8 23 26' 22" N 23 27' 30" E 23° 26.3667′ N 23° 27.5′ E
9 UT: N 39°20' 0'' / W 74°35' 0'' 39° 20′ N 74° 35′ W
10 hello NaN NaN
11 NaN NaN NaN
12 NULL NaN NaN

4. inplace parameter

This just deletes the given column from the returned dataframe. A new column containing cleaned coordinates is added with a title in the format "{original title}_clean".

[8]:
clean_lat_long(df, "lat_long", inplace=True)
Latitude and Longitude Cleaning Report:
        8 values cleaned (61.54%)
        2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)
[8]:
lat_long_clean
0 (41.5, -81.0)
1 (41.5, -81.0)
2 (41.5, -81.0)
3 (41.5, -81.0)
4 (41.5, -81.0)
5 (-41.5, 81.0)
6 NaN
7 (23.4394, 23.4583)
8 (23.4394, 23.4583)
9 (39.3333, -74.5833)
10 NaN
11 NaN
12 NaN

inplace and split

[9]:
clean_lat_long(df, "lat_long", split=True, inplace=True)
Latitude and Longitude Cleaning Report:
        9 values cleaned (69.23%)
        2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)
[9]:
latitude longitude
0 41.5000 -81.0000
1 41.5000 -81.0000
2 41.5000 -81.0000
3 41.5000 -81.0000
4 41.5000 -81.0000
5 -41.5000 81.0000
6 NaN NaN
7 23.4394 23.4583
8 23.4394 23.4583
9 39.3333 -74.5833
10 NaN NaN
11 NaN NaN
12 NaN NaN

5. Latitude and longitude coordinates in separate columns

Clean latitude or longitude coordinates individually

[10]:
df = pd.DataFrame({"lat": [" 30′ 0″ E", "41° 30′ N", "41 S", "80", "hello", "NA"]})
clean_lat_long(df, lat_col="lat")
Latitude and Longitude Cleaning Report:
        3 values cleaned (50.0%)
        2 values unable to be parsed (33.33%), set to NaN
Result contains 3 (50.0%) values in the correct format and 3 null values (50.0%)
[10]:
lat lat_clean
0 30′ 0″ E NaN
1 41° 30′ N 41.5
2 41 S -41.0
3 80 80.0
4 hello NaN
5 NA NaN

Combine and clean separate columns

Latitude and longitude values are counted separately in the report.

[11]:
df = pd.DataFrame({"lat": ["30° E", "41° 30′ N", "41 S", "80", "hello", "NA"],
                      "long": ["30° E", "41° 30′ N", "41 W", "80", "hello", "NA"]})
clean_lat_long(df, lat_col="lat", long_col="long")
Latitude and Longitude Cleaning Report:
        6 values cleaned (100.0%)
Result contains 6 (100.0%) values in the correct format and 0 null values (0.0%)
[11]:
lat long latitude_longitude
0 30° E 30° E (nan, 30.0)
1 41° 30′ N 41° 30′ N (41.5, nan)
2 41 S 41 W (-41.0, -41.0)
3 80 80 (80.0, 80.0)
4 hello hello (nan, nan)
5 NA NA (nan, nan)

Clean separate columns and split the output

[12]:
clean_lat_long(df, lat_col="lat", long_col="long", split=True)
Latitude and Longitude Cleaning Report:
        3 values cleaned (50.0%)
        2 values unable to be parsed (33.33%), set to NaN
Result contains 3 (50.0%) values in the correct format and 3 null values (50.0%)
[12]:
lat long lat_clean long_clean
0 30° E 30° E NaN 30.0
1 41° 30′ N 41° 30′ N 41.5 NaN
2 41 S 41 W -41.0 -41.0
3 80 80 80.0 80.0
4 hello hello NaN NaN
5 NA NA NaN NaN

6. validate_lat_long()

validate_lat_long() returns True when the input is a valid latitude or longitude value otherwise it returns False. Valid types are the same as clean_lat_long().

[13]:
from dataprep.clean import validate_lat_long
print(validate_lat_long("41° 30′ 0″ N"))
print(validate_lat_long("41.5 S;81.0 E"))
print(validate_lat_long("-41.5 S;81.0 E"))
print(validate_lat_long((41.5, 81)))
print(validate_lat_long(41.5, lat_long=False, lat=True))
False
True
False
True
True
[14]:
df = pd.DataFrame({"lat_long":
                   [(41.5, -81.0), "41.5;-81.0", "41.5,-81.0", "41.5 -81.0",
                    "41.5° N, 81.0° W", "-41.5 S;81.0 E",
                    "23 26m 22s N 23 27m 30s E", "23 26' 22\" N 23 27' 30\" E",
                    "UT: N 39°20' 0'' / W 74°35' 0''", "hello", np.nan, "NULL"]
                  })
validate_lat_long(df["lat_long"])
[14]:
0      True
1      True
2      True
3      True
4      True
5     False
6      True
7      True
8      True
9     False
10    False
11    False
Name: lat_long, dtype: bool

Validate only one coordinate

[15]:
df = pd.DataFrame({"lat":
                   [41.5, "41.5", "41.5  ",
                    "41.5° N", "-41.5 S",
                    "23 26m 22s N", "23 26' 22\" N",
                    "UT: N 39°20' 0''", "hello", np.nan, "NULL"]
                  })
validate_lat_long(df["lat"], lat_long=False, lat=True)
[15]:
0      True
1      True
2      True
3      True
4     False
5      True
6      True
7      True
8     False
9     False
10    False
Name: lat, dtype: bool