# Geographic Coordinates¶

## Introduction¶

The function clean_lat_long() cleans and standardizes a DataFrame column containing latitude and/or longitude coordinates. The function validate_lat_long() validates either a single coordinate or a column of coordinates, returning True if the value is valid, and False otherwise.

The following latitude and longitude formats are supported by the output_format parameter:

• Decimal degrees (dd): 41.5

• Decimal degrees hemisphere (ddh): “41.5° N”

• Degrees minutes (dm): “41° 30′ N”

• Degrees minutes seconds (dms): “41° 30′ 0″ N”

You can split a column of geographic coordinates into one column for latitude and another for longitude by setting the parameter split to True.

Invalid parsing is handled with the errors parameter:

• “coerce” (default): invalid parsing will be set to NaN

• “ignore”: invalid parsing will return the input

• “raise”: invalid parsing will raise an exception

After cleaning, a report is printed that provides the following information:

• How many values were cleaned (the value must have been transformed).

• How many values could not be parsed.

• A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.

The following sections demonstrate the functionality of clean_lat_long() and validate_lat_long().

### An example dataset with geographic coordinates¶

[1]:

import pandas as pd
import numpy as np
df = pd.DataFrame({
"lat_long":
[(41.5, -81.0), "41.5;-81.0", "41.5,-81.0", "41.5 -81.0",
"41.5° N, 81.0° W", "41.5 S;81.0 E", "-41.5 S;81.0 E",
"23 26m 22s N 23 27m 30s E", "23 26' 22\" N 23 27' 30\" E",
"UT: N 39°20' 0'' / W 74°35' 0''", "hello", np.nan, "NULL"]
})
df

[1]:

lat_long
0 (41.5, -81.0)
1 41.5;-81.0
2 41.5,-81.0
3 41.5 -81.0
4 41.5° N, 81.0° W
5 41.5 S;81.0 E
6 -41.5 S;81.0 E
7 23 26m 22s N 23 27m 30s E
8 23 26' 22" N 23 27' 30" E
9 UT: N 39°20' 0'' / W 74°35' 0''
10 hello
11 NaN
12 NULL

## 1. Default clean_lat_long()¶

By default, the output_format parameter is set to “dd” (decimal degrees) and the errors parameter is set to “coerce” (set to NaN when parsing is invalid).

[2]:

from dataprep.clean import clean_lat_long
clean_lat_long(df, "lat_long")

Latitude and Longitude Cleaning Report:
8 values cleaned (61.54%)
2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)

[2]:

lat_long lat_long_clean
0 (41.5, -81.0) (41.5, -81.0)
1 41.5;-81.0 (41.5, -81.0)
2 41.5,-81.0 (41.5, -81.0)
3 41.5 -81.0 (41.5, -81.0)
4 41.5° N, 81.0° W (41.5, -81.0)
5 41.5 S;81.0 E (-41.5, 81.0)
6 -41.5 S;81.0 E NaN
7 23 26m 22s N 23 27m 30s E (23.4394, 23.4583)
8 23 26' 22" N 23 27' 30" E (23.4394, 23.4583)
9 UT: N 39°20' 0'' / W 74°35' 0'' (39.3333, -74.5833)
10 hello NaN
11 NaN NaN
12 NULL NaN

Note (41.5, -81.0) is considered not cleaned in the report since it’s resulting format is the same as the input. Also, “-41.5 S;81.0 E” is invalid because if a coordinate has a hemisphere it cannot contain a negative decimal value.

## 2. Output formats¶

This section demonstrates the supported latitudinal and longitudinal formats.

### decimal degrees hemisphere (ddh)¶

[3]:

clean_lat_long(df, "lat_long", output_format="ddh")

Latitude and Longitude Cleaning Report:
8 values cleaned (61.54%)
2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)

[3]:

lat_long lat_long_clean
0 (41.5, -81.0) 41.5° N, 81.0° W
1 41.5;-81.0 41.5° N, 81.0° W
2 41.5,-81.0 41.5° N, 81.0° W
3 41.5 -81.0 41.5° N, 81.0° W
4 41.5° N, 81.0° W 41.5° N, 81.0° W
5 41.5 S;81.0 E 41.5° S, 81.0° E
6 -41.5 S;81.0 E NaN
7 23 26m 22s N 23 27m 30s E 23.4394° N, 23.4583° E
8 23 26' 22" N 23 27' 30" E 23.4394° N, 23.4583° E
9 UT: N 39°20' 0'' / W 74°35' 0'' 39.3333° N, 74.5833° W
10 hello NaN
11 NaN NaN
12 NULL NaN

### degrees minutes (dm)¶

[4]:

clean_lat_long(df, "lat_long", output_format="dm")

Latitude and Longitude Cleaning Report:
9 values cleaned (69.23%)
2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)

[4]:

lat_long lat_long_clean
0 (41.5, -81.0) 41° 30′ N, 81° 0′ W
1 41.5;-81.0 41° 30′ N, 81° 0′ W
2 41.5,-81.0 41° 30′ N, 81° 0′ W
3 41.5 -81.0 41° 30′ N, 81° 0′ W
4 41.5° N, 81.0° W 41° 30′ N, 81° 0′ W
5 41.5 S;81.0 E 41° 30′ S, 81° 0′ E
6 -41.5 S;81.0 E NaN
7 23 26m 22s N 23 27m 30s E 23° 26.3667′ N, 23° 27.5′ E
8 23 26' 22" N 23 27' 30" E 23° 26.3667′ N, 23° 27.5′ E
9 UT: N 39°20' 0'' / W 74°35' 0'' 39° 20′ N, 74° 35′ W
10 hello NaN
11 NaN NaN
12 NULL NaN

### degrees minutes seconds (dms)¶

[5]:

clean_lat_long(df, "lat_long", output_format="dms")

Latitude and Longitude Cleaning Report:
9 values cleaned (69.23%)
2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)

[5]:

lat_long lat_long_clean
0 (41.5, -81.0) 41° 30′ 0″ N, 81° 0′ 0″ W
1 41.5;-81.0 41° 30′ 0″ N, 81° 0′ 0″ W
2 41.5,-81.0 41° 30′ 0″ N, 81° 0′ 0″ W
3 41.5 -81.0 41° 30′ 0″ N, 81° 0′ 0″ W
4 41.5° N, 81.0° W 41° 30′ 0″ N, 81° 0′ 0″ W
5 41.5 S;81.0 E 41° 30′ 0″ S, 81° 0′ 0″ E
6 -41.5 S;81.0 E NaN
7 23 26m 22s N 23 27m 30s E 23° 26′ 22″ N, 23° 27′ 30″ E
8 23 26' 22" N 23 27' 30" E 23° 26′ 22″ N, 23° 27′ 30″ E
9 UT: N 39°20' 0'' / W 74°35' 0'' 39° 20′ 0″ N, 74° 34′ 60″ W
10 hello NaN
11 NaN NaN
12 NULL NaN

## 3. split parameter¶

The split parameter adds individual columns containing the cleaned latitude and longitude values to the given DataFrame.

[6]:

clean_lat_long(df, "lat_long", split=True)

Latitude and Longitude Cleaning Report:
9 values cleaned (69.23%)
2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)

[6]:

lat_long latitude longitude
0 (41.5, -81.0) 41.5000 -81.0000
1 41.5;-81.0 41.5000 -81.0000
2 41.5,-81.0 41.5000 -81.0000
3 41.5 -81.0 41.5000 -81.0000
4 41.5° N, 81.0° W 41.5000 -81.0000
5 41.5 S;81.0 E -41.5000 81.0000
6 -41.5 S;81.0 E NaN NaN
7 23 26m 22s N 23 27m 30s E 23.4394 23.4583
8 23 26' 22" N 23 27' 30" E 23.4394 23.4583
9 UT: N 39°20' 0'' / W 74°35' 0'' 39.3333 -74.5833
10 hello NaN NaN
11 NaN NaN NaN
12 NULL NaN NaN

Split can be used along with different output formats.

[7]:

clean_lat_long(df, "lat_long", split=True, output_format="dm")

Latitude and Longitude Cleaning Report:
9 values cleaned (69.23%)
2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)

[7]:

lat_long latitude longitude
0 (41.5, -81.0) 41° 30′ N 81° 0′ W
1 41.5;-81.0 41° 30′ N 81° 0′ W
2 41.5,-81.0 41° 30′ N 81° 0′ W
3 41.5 -81.0 41° 30′ N 81° 0′ W
4 41.5° N, 81.0° W 41° 30′ N 81° 0′ W
5 41.5 S;81.0 E 41° 30′ S 81° 0′ E
6 -41.5 S;81.0 E NaN NaN
7 23 26m 22s N 23 27m 30s E 23° 26.3667′ N 23° 27.5′ E
8 23 26' 22" N 23 27' 30" E 23° 26.3667′ N 23° 27.5′ E
9 UT: N 39°20' 0'' / W 74°35' 0'' 39° 20′ N 74° 35′ W
10 hello NaN NaN
11 NaN NaN NaN
12 NULL NaN NaN

## 4. inplace parameter¶

This just deletes the given column from the returned dataframe. A new column containing cleaned coordinates is added with a title in the format "{original title}_clean".

[8]:

clean_lat_long(df, "lat_long", inplace=True)

Latitude and Longitude Cleaning Report:
8 values cleaned (61.54%)
2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)

[8]:

lat_long_clean
0 (41.5, -81.0)
1 (41.5, -81.0)
2 (41.5, -81.0)
3 (41.5, -81.0)
4 (41.5, -81.0)
5 (-41.5, 81.0)
6 NaN
7 (23.4394, 23.4583)
8 (23.4394, 23.4583)
9 (39.3333, -74.5833)
10 NaN
11 NaN
12 NaN

### inplace and split¶

[9]:

clean_lat_long(df, "lat_long", split=True, inplace=True)

Latitude and Longitude Cleaning Report:
9 values cleaned (69.23%)
2 values unable to be parsed (15.38%), set to NaN
Result contains 9 (69.23%) values in the correct format and 4 null values (30.77%)

[9]:

latitude longitude
0 41.5000 -81.0000
1 41.5000 -81.0000
2 41.5000 -81.0000
3 41.5000 -81.0000
4 41.5000 -81.0000
5 -41.5000 81.0000
6 NaN NaN
7 23.4394 23.4583
8 23.4394 23.4583
9 39.3333 -74.5833
10 NaN NaN
11 NaN NaN
12 NaN NaN

## 5. Latitude and longitude coordinates in separate columns¶

### Clean latitude or longitude coordinates individually¶

[10]:

df = pd.DataFrame({"lat": [" 30′ 0″ E", "41° 30′ N", "41 S", "80", "hello", "NA"]})
clean_lat_long(df, lat_col="lat")

Latitude and Longitude Cleaning Report:
3 values cleaned (50.0%)
2 values unable to be parsed (33.33%), set to NaN
Result contains 3 (50.0%) values in the correct format and 3 null values (50.0%)

[10]:

lat lat_clean
0 30′ 0″ E NaN
1 41° 30′ N 41.5
2 41 S -41.0
3 80 80.0
4 hello NaN
5 NA NaN

### Combine and clean separate columns¶

Latitude and longitude values are counted separately in the report.

[11]:

df = pd.DataFrame({"lat": ["30° E", "41° 30′ N", "41 S", "80", "hello", "NA"],
"long": ["30° E", "41° 30′ N", "41 W", "80", "hello", "NA"]})
clean_lat_long(df, lat_col="lat", long_col="long")

Latitude and Longitude Cleaning Report:
6 values cleaned (100.0%)
Result contains 6 (100.0%) values in the correct format and 0 null values (0.0%)

[11]:

lat long latitude_longitude
0 30° E 30° E (nan, 30.0)
1 41° 30′ N 41° 30′ N (41.5, nan)
2 41 S 41 W (-41.0, -41.0)
3 80 80 (80.0, 80.0)
4 hello hello (nan, nan)
5 NA NA (nan, nan)

### Clean separate columns and split the output¶

[12]:

clean_lat_long(df, lat_col="lat", long_col="long", split=True)

Latitude and Longitude Cleaning Report:
3 values cleaned (50.0%)
2 values unable to be parsed (33.33%), set to NaN
Result contains 3 (50.0%) values in the correct format and 3 null values (50.0%)

[12]:

lat long lat_clean long_clean
0 30° E 30° E NaN 30.0
1 41° 30′ N 41° 30′ N 41.5 NaN
2 41 S 41 W -41.0 -41.0
3 80 80 80.0 80.0
4 hello hello NaN NaN
5 NA NA NaN NaN

## 6. validate_lat_long()¶

validate_lat_long() returns True when the input is a valid latitude or longitude value otherwise it returns False. Valid types are the same as clean_lat_long().

[13]:

from dataprep.clean import validate_lat_long
print(validate_lat_long("41° 30′ 0″ N"))
print(validate_lat_long("41.5 S;81.0 E"))
print(validate_lat_long("-41.5 S;81.0 E"))
print(validate_lat_long((41.5, 81)))
print(validate_lat_long(41.5, lat_long=False, lat=True))

False
True
False
True
True

[14]:

df = pd.DataFrame({"lat_long":
[(41.5, -81.0), "41.5;-81.0", "41.5,-81.0", "41.5 -81.0",
"41.5° N, 81.0° W", "-41.5 S;81.0 E",
"23 26m 22s N 23 27m 30s E", "23 26' 22\" N 23 27' 30\" E",
"UT: N 39°20' 0'' / W 74°35' 0''", "hello", np.nan, "NULL"]
})
validate_lat_long(df["lat_long"])

[14]:

0      True
1      True
2      True
3      True
4      True
5     False
6      True
7      True
8      True
9     False
10    False
11    False
Name: lat_long, dtype: bool


### Validate only one coordinate¶

[15]:

df = pd.DataFrame({"lat":
[41.5, "41.5", "41.5  ",
"41.5° N", "-41.5 S",
"23 26m 22s N", "23 26' 22\" N",
"UT: N 39°20' 0''", "hello", np.nan, "NULL"]
})
validate_lat_long(df["lat"], lat_long=False, lat=True)

[15]:

0      True
1      True
2      True
3      True
4     False
5      True
6      True
7      True
8     False
9     False
10    False
Name: lat, dtype: bool