clean_date()
The function clean_date() cleans a column containing date strings, and standardizes them in a desired format. The function validate_date() validates either a single date string or a column of date strings, returning “cleaned” standing for at the first stage the value is valid, and “unknown” otherwise. Note that the first stage means the initial format is correct. However, if the scenario like minite equals to 70 occurs, the function cannot immediately recognize this kind of error at first stage. They will be recognized during running of clean_date(). Also, this kind of error will not be cleaned by our function.
validate_date()
Currently, many flexible date format like the following format are supported as valid input:
1996.07.10 AD at 15:08:56 PDT
Tuesday, April 12, 1952 AD 3:30:42pm PST
2003 Sep 25
12:00am
Thu Sep 25 10:36:28 2003
Various delimiters between the digits are also allowed: [" ", ".", ",", ";", "-", "/", "'", "st", "nd", "rd", "th", "at", "on", "and", "ad", "AD", "of"]
[" ", ".", ",", ";", "-", "/", "'", "st", "nd", "rd", "th", "at", "on", "and", "ad", "AD", "of"]
Phone numbers can be converted to the following formats via the target_format parameter. Also, users can specify many flexible target format like these:
target_format
YYYY-MM-DD
yyyy.MM.dd AD at HH:mm:ss Z
EEE, d MMM yyyy HH:mm:ss Z
Users also can specify origin_timezone and target_timezone like PDT,GMT etc. When formatting the date, timezone will be transferred from origin timezone to target timezone.
origin_timezone
target_timezone
PDT
GMT
Invalid parsing is handled with the fix_empty parameter:
fix_empty
auto_minimum (default):
auto_minimum
For hours, minutes and seconds, just fill them with zeros
For years, months and days, fill it with the minimum value
empty: just left the missing component as it is
empty
auto_nearest:
auto_nearest
For hours, minutes and seconds, fill it with the nearest value
For years, months and days, fill it with the nearest value
After cleaning, a report is printed that provides the following information:
How many values were cleaned (the value must be transformed)
How many values could not be cleaned
And the data summary: how many values are in the correct format, and how many values are null
The following sections demonstrate the functionality of clean_date() and validate_date().
[1]:
import pandas as pd import numpy as np df = pd.DataFrame({"date": ['1996.07.10 AD at 15:08:56 PDT', 'Thu Sep 25 10:36:28 2003', 'Thu Sep 25 10:36:28 BRST 2003', '2003 10:36:28 BRST 25 Sep Thu', 'Thu Sep 25 10:36:28 2003', 'Thu 10:36:28', 'Thu 10:36', '10:36', 'Thu Sep 25 2003', 'Sep 25 2003', 'Sep 2003', 'Sep', '2003', '2003-09-25', '2003-Sep-25', '25-Sep-2003', 'Sep-25-2003', '09-25-2003', '10-09-2003', '10-09-03', '2003.Sep.25', '2003/09/25', '2003 Sep 25', '2003 09 25', '10pm', '12:00am', 'Sep 03', 'Sep of 03', 'Wed, July 10, 96', '1996.07.10 AD at 15:08:56 PDT', 'Tuesday, April 12, 1952 AD 3:30:42pm PST', 'November 5, 1994, 8:15:30 am EST', '3rd of May 2001', '5:50 AM on June 13, 1990', 'NULL', 'nan', 'I\'m a little cat', 'This is Sep.']}) df
By default, the target_format parameter is set to “YYYY-MM-DD hh:mm:ss”, the origin_timezone parameter is set to “UTC”, the fix_empty parameter is set to “auto_minimum” and the show_report parameter is set to “True”. And we don’t specify the target_timezone parameter.
show_report
[2]:
from dataprep.clean import clean_date clean_date(df, 'date')
Dates Cleaning Report: 34 values cleaned (89.47%) 2 values unable to be parsed (5.26%), set to NaN Result contains 34 (89.47%) values in the correct format and 4 null values (10.53%)
output_format
This section demonstrate some valid target format. In fact, our function can support very flexible target formats, such as YYYY-MM-DD and yyyy.MM.dd AD at HH:mm:ss z. Users just need to specify tokens standing for year, month, day, hour, minute and second with valid separators.
yyyy.MM.dd AD at HH:mm:ss z
The tokens we support are listed in the following table.
Component
Token
Year
"yyyy"(2015), "yy"(15), "YYYY"(2015), "YY"(15), "Y"(15), "y"(15)
Month
"MM"(01), "M"(1), "MMM"(Jan.), "MMMMM"(January)
Day
"dd"(05), "d"(5), "DD"(05), "D"(5)
Hour
"hh"(06), "h"(6), "HH"(06), "H"(6)
Minute
"mm"(08), "m"(8)
Second
"ss"(09), "s"(9), "SS"(09), "S"(9)
Weekday
"eee"(Mon.), "EEE"(Mon.), "eeeee"(Monday), "EEEEE"(Monday)
Timezone
"Z"(UTC+00:00),'z'(GMT)
The separators we support are listed here: [" ", ".", ",", ";", "-", "/", "'", "st", "nd", "rd", "th", "at", "on", "and", "ad", "AD", "of"]
[4]:
clean_date(df, 'date', output_format='YYYY-MM-DD')
Dates Cleaning Report: 33 values cleaned (86.84%) 2 values unable to be parsed (5.26%), set to NaN Result contains 34 (89.47%) values in the correct format and 4 null values (10.53%)
[5]:
clean_date(df, 'date', output_format='yyyy.MM.dd AD at HH:mm:ss Z')
[6]:
clean_date(df, 'date', output_format='yyyy.MM.dd AD at HH:mm:ss z')
[7]:
clean_date(df, 'date', output_format='EEE, d MMM yyyy HH:mm:ss Z')
input_timezone
output_timezone
This section demostrates valide origin timezones and target timezones. input_timezone in our function means user-specified timezone for input data. output_timezone in our function means user-specified timezone for output data.
In our function, the range of input_timezone and output_timezone includes two parts: * All timezones in pytz.all_timezones * Abbreviation for common-used timezones
pytz.all_timezones
Timezone Name
UTC offset
UTC
0
ACT
-5
ADT
-3
AEDT
11
AEST
10
AKDT
-8
AKST
-9
AMST
AMT
-4
ART
ArabiaST
3
AtlanticST
AWST
8
AZOST
AZOT
BOT
BRST
-2
BRT
BST
1
BTT
6
CAT
2
CDT
CEST
CET
CHOST
9
CHOT
CHUT
CKT
-10
CLST
CLT
CentralST
-6
ChinaST
CubaST
ChST
EASST
EAST
EAT
ECT
EDT
EEST
EET
EST
FKST
GFT
GILT
12
GST
4
HKT
HST
ICT
7
IDT
IrishST
IsraelST
JST
KOST
LINT
MDT
MHT
MSK
MST
-7
MYT
NUT
-11
NZDT
13
NZST
PET
PGT
PHT
PONT
PST
SAST
SBT
SGT
SRT
SST
TAHT
TLT
TVT
ULAST
ULAT
UYST
UYT
VET
WAST
WAT
WEST
WET
WIB
WIT
WITA
input_timezone: PDT
output_timezone: ChinaST
output_format: yyyy.MM.dd AD at HH:mm:ss Z
[8]:
clean_date(df, 'date', input_timezone='PDT', output_timezone='ChinaST',output_format='yyyy.MM.dd AD at HH:mm:ss Z')
input_timezone: EST
output_timezone: PDT
[9]:
clean_date(df, 'date', input_timezone='EST', output_timezone='PDT',output_format='yyyy.MM.dd AD at HH:mm:ss Z')
input_timezone: PST
output_timezone: GMT
[10]:
clean_date(df, 'date', input_timezone='PST', output_timezone='GMT',output_format='yyyy.MM.dd AD at HH:mm:ss Z')
fix_missing
This section demostrates valid options of fix_missing parameter. The user can specify the way of fixing empty value from value set: {‘empty’, ‘current’, ‘minimum’}. The default fixed_empty is 'minimum'
'minimum'
Min value of year: 2000
Min value of month: 1
Min value of day: 1
[12]:
clean_date(df, 'date', fix_missing='minimum')
Just left the missing component as it is
[13]:
clean_date(df, 'date', fix_missing='empty')
For hours, minutes and seconds, just fill them with nearest time value
For years, months and days, fill it with the nearest date
[14]:
clean_date(df, 'date', fix_missing='current')
infer_day_first
If infer_day_first = True, the clean_date funtion infers the day number in an ambiguous string column automatically.
infer_day_first = True
clean_date
If infer_day_first = False, do nothing.
infer_day_first = False
By default, infer_day_first = True
import pandas as pd import numpy as np df = pd.DataFrame( {'date': ['12-01-06', '12-04-06', '12-05-06', '20-12-06', '21-12-06', '29-12-06']}) from dataprep.clean import clean_date clean_date(df, 'date')
Dates Cleaning Report: 6 values cleaned (100.0%) Result contains 6 (100.0%) values in the correct format and 0 null values (0.0%)
df = pd.DataFrame( {'date': ['12-01-06', '12-04-06', '12-05-06', '12-20-06', '12-21-06', '12-29-06']}) from dataprep.clean import clean_date clean_date(df, 'date')
[3]:
from dataprep.clean import clean_date clean_date(df, 'date', infer_day_first = False)
report
If report = True, a report contains:
report = True
How many values are cleaned
How many values are unable to cleaned (due to their invalid format)
How many values are with correct format
How many null values are there
will be generated.
If report = False, the report won’t be generated.
report = False
[16]:
clean_date(df, 'date', report=True)
[17]:
clean_date(df, 'date', report=False)
validate_date() returns True when the input has a valid date format. Otherwise the function returns False.
True
False
[20]:
from dataprep.clean import validate_date print(validate_date("Novvvvvvvvember 5, 1994, 8:15:30 am EST hahaha")) print(validate_date("1994, 8:15:30")) print(validate_date("Hello."))
False True False
[21]:
df = pd.DataFrame({"messy_date": ["T, Ap 12, 1952 AD 3:30:42p", "5:50 AM on June 13, 1990", "3rd of May 2001", "55/23/2014", "10pm", "10p", "2003-Sep-25", "Sepppppp", "23 4 1962", "2003 10:36:28 BRST 25 Sep Thu", "hello", np.nan, "NULL"] }) df["valid"] = validate_date(df["messy_date"]) df
[ ]: