clean_date(): Clean and validate date strings

Introduction

The function clean_date() cleans a column containing date strings, and standardizes them in a desired format. The function validate_date() validates either a single date string or a column of date strings, returning “cleaned” standing for at the first stage the value is valid, and “unknown” otherwise. Note that the first stage means the initial format is correct. However, if the scenario like minite equals to 70 occurs, the function cannot immediately recognize this kind of error at first stage. They will be recognized during running of clean_date(). Also, this kind of error will not be cleaned by our function.

Currently, many flexible date format like the following format are supported as valid input:

  • 1996.07.10 AD at 15:08:56 PDT

  • Tuesday, April 12, 1952 AD 3:30:42pm PST

  • 2003 Sep 25

  • 12:00am

  • Thu Sep 25 10:36:28 2003

Various delimiters between the digits are also allowed: [" ", ".", ",", ";", "-", "/", "'", "st", "nd", "rd", "th", "at", "on", "and", "ad", "AD", "of"]

Phone numbers can be converted to the following formats via the target_format parameter. Also, users can specify many flexible target format like these:

  • YYYY-MM-DD

  • yyyy.MM.dd AD at HH:mm:ss Z

  • EEE, d MMM yyyy HH:mm:ss Z

Users also can specify origin_timezone and target_timezone like PDT,GMT etc. When formatting the date, timezone will be transferred from origin timezone to target timezone.

Invalid parsing is handled with the fix_empty parameter:

  • auto_minimum (default):

    • For hours, minutes and seconds, just fill them with zeros

    • For years, months and days, fill it with the minimum value

  • empty: just left the missing component as it is

  • auto_nearest:

    • For hours, minutes and seconds, fill it with the nearest value

    • For years, months and days, fill it with the nearest value

After cleaning, a report is printed that provides the following information:

  • How many values were cleaned (the value must be transformed)

  • How many values could not be cleaned

  • And the data summary: how many values are in the correct format, and how many values are null

The following sections demonstrate the functionality of clean_date() and validate_date().

An example dirty dataset

[1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({"date":
                   ['1996.07.10 AD at 15:08:56 PDT',
                    'Thu Sep 25 10:36:28 2003',
                    'Thu Sep 25 10:36:28 BRST 2003',
                    '2003 10:36:28 BRST 25 Sep Thu',
                    'Thu Sep 25 10:36:28 2003',
                    'Thu 10:36:28',
                    'Thu 10:36',
                    '10:36',
                    'Thu Sep 25 2003',
                    'Sep 25 2003',
                    'Sep 2003',
                    'Sep',
                    '2003',
                    '2003-09-25',
                    '2003-Sep-25',
                    '25-Sep-2003',
                    'Sep-25-2003',
                    '09-25-2003',
                    '10-09-2003',
                    '10-09-03',
                    '2003.Sep.25',
                    '2003/09/25',
                    '2003 Sep 25',
                    '2003 09 25',
                    '10pm',
                    '12:00am',
                    'Sep 03',
                    'Sep of 03',
                    'Wed, July 10, 96',
                    '1996.07.10 AD at 15:08:56 PDT',
                    'Tuesday, April 12, 1952 AD 3:30:42pm PST',
                    'November 5, 1994, 8:15:30 am EST',
                    '3rd of May 2001',
                    '5:50 AM on June 13, 1990',
                    'NULL',
                    'nan',
                    'I\'m a little cat',
                    'This is Sep.']})
df
[1]:
date
0 1996.07.10 AD at 15:08:56 PDT
1 Thu Sep 25 10:36:28 2003
2 Thu Sep 25 10:36:28 BRST 2003
3 2003 10:36:28 BRST 25 Sep Thu
4 Thu Sep 25 10:36:28 2003
5 Thu 10:36:28
6 Thu 10:36
7 10:36
8 Thu Sep 25 2003
9 Sep 25 2003
10 Sep 2003
11 Sep
12 2003
13 2003-09-25
14 2003-Sep-25
15 25-Sep-2003
16 Sep-25-2003
17 09-25-2003
18 10-09-2003
19 10-09-03
20 2003.Sep.25
21 2003/09/25
22 2003 Sep 25
23 2003 09 25
24 10pm
25 12:00am
26 Sep 03
27 Sep of 03
28 Wed, July 10, 96
29 1996.07.10 AD at 15:08:56 PDT
30 Tuesday, April 12, 1952 AD 3:30:42pm PST
31 November 5, 1994, 8:15:30 am EST
32 3rd of May 2001
33 5:50 AM on June 13, 1990
34 NULL
35 nan
36 I'm a little cat
37 This is Sep.

1. Default clean_date()

By default, the target_format parameter is set to “YYYY-MM-DD hh:mm:ss”, the origin_timezone parameter is set to “UTC”, the fix_empty parameter is set to “auto_minimum” and the show_report parameter is set to “True”. And we don’t specify the target_timezone parameter.

[2]:
from dataprep.clean import clean_date
clean_date(df, 'date')
Dates Cleaning Report:
        34 values cleaned (89.47%)
        2 values unable to be parsed (5.26%), set to NaN
Result contains 34 (89.47%) values in the correct format and 4 null values (10.53%)
[2]:
date date_clean
0 1996.07.10 AD at 15:08:56 PDT 1996-07-10 15:08:56
1 Thu Sep 25 10:36:28 2003 2003-09-25 10:36:28
2 Thu Sep 25 10:36:28 BRST 2003 2003-09-25 10:36:28
3 2003 10:36:28 BRST 25 Sep Thu 2003-09-25 10:36:28
4 Thu Sep 25 10:36:28 2003 2003-09-25 10:36:28
5 Thu 10:36:28 2000-01-01 10:36:28
6 Thu 10:36 2000-01-01 10:36:00
7 10:36 2000-01-01 10:36:00
8 Thu Sep 25 2003 2003-09-25 00:00:00
9 Sep 25 2003 2003-09-25 00:00:00
10 Sep 2003 2003-09-01 00:00:00
11 Sep 2000-09-01 00:00:00
12 2003 2003-01-01 00:00:00
13 2003-09-25 2003-09-25 00:00:00
14 2003-Sep-25 2003-09-25 00:00:00
15 25-Sep-2003 2003-09-25 00:00:00
16 Sep-25-2003 2003-09-25 00:00:00
17 09-25-2003 2003-09-25 00:00:00
18 10-09-2003 2003-10-09 00:00:00
19 10-09-03 2003-10-09 00:00:00
20 2003.Sep.25 2003-09-25 00:00:00
21 2003/09/25 2003-09-25 00:00:00
22 2003 Sep 25 2003-09-25 00:00:00
23 2003 09 25 2003-09-25 00:00:00
24 10pm 2000-01-01 22:00:00
25 12:00am 2000-01-01 12:00:00
26 Sep 03 2003-09-01 00:00:00
27 Sep of 03 2003-09-01 00:00:00
28 Wed, July 10, 96 2096-07-10 00:00:00
29 1996.07.10 AD at 15:08:56 PDT 1996-07-10 15:08:56
30 Tuesday, April 12, 1952 AD 3:30:42pm PST 1952-04-12 15:30:42
31 November 5, 1994, 8:15:30 am EST 1994-11-05 08:15:30
32 3rd of May 2001 2001-05-03 00:00:00
33 5:50 AM on June 13, 1990 1990-06-13 05:50:00
34 NULL NaN
35 nan NaN
36 I'm a little cat NaN
37 This is Sep. NaN

2. output_format parameter

This section demonstrate some valid target format. In fact, our function can support very flexible target formats, such as YYYY-MM-DD and yyyy.MM.dd AD at HH:mm:ss z. Users just need to specify tokens standing for year, month, day, hour, minute and second with valid separators.

The tokens we support are listed in the following table.

Component

Token

Year

"yyyy"(2015), "yy"(15), "YYYY"(2015), "YY"(15), "Y"(15), "y"(15)

Month

"MM"(01), "M"(1), "MMM"(Jan.), "MMMMM"(January)

Day

"dd"(05), "d"(5), "DD"(05), "D"(5)

Hour

"hh"(06), "h"(6), "HH"(06), "H"(6)

Minute

"mm"(08), "m"(8)

Second

"ss"(09), "s"(9), "SS"(09), "S"(9)

Weekday

"eee"(Mon.), "EEE"(Mon.), "eeeee"(Monday), "EEEEE"(Monday)

Timezone

"Z"(UTC+00:00),'z'(GMT)

The separators we support are listed here: [" ", ".", ",", ";", "-", "/", "'", "st", "nd", "rd", "th", "at", "on", "and", "ad", "AD", "of"]

Example format: YYYY-MM-DD

[4]:
clean_date(df, 'date', output_format='YYYY-MM-DD')
Dates Cleaning Report:
        33 values cleaned (86.84%)
        2 values unable to be parsed (5.26%), set to NaN
Result contains 34 (89.47%) values in the correct format and 4 null values (10.53%)
[4]:
date date_clean
0 1996.07.10 AD at 15:08:56 PDT 1996-07-10
1 Thu Sep 25 10:36:28 2003 2003-09-25
2 Thu Sep 25 10:36:28 BRST 2003 2003-09-25
3 2003 10:36:28 BRST 25 Sep Thu 2003-09-25
4 Thu Sep 25 10:36:28 2003 2003-09-25
5 Thu 10:36:28 2000-01-01
6 Thu 10:36 2000-01-01
7 10:36 2000-01-01
8 Thu Sep 25 2003 2003-09-25
9 Sep 25 2003 2003-09-25
10 Sep 2003 2003-09-01
11 Sep 2000-09-01
12 2003 2003-01-01
13 2003-09-25 2003-09-25
14 2003-Sep-25 2003-09-25
15 25-Sep-2003 2003-09-25
16 Sep-25-2003 2003-09-25
17 09-25-2003 2003-09-25
18 10-09-2003 2003-10-09
19 10-09-03 2003-10-09
20 2003.Sep.25 2003-09-25
21 2003/09/25 2003-09-25
22 2003 Sep 25 2003-09-25
23 2003 09 25 2003-09-25
24 10pm 2000-01-01
25 12:00am 2000-01-01
26 Sep 03 2003-09-01
27 Sep of 03 2003-09-01
28 Wed, July 10, 96 2096-07-10
29 1996.07.10 AD at 15:08:56 PDT 1996-07-10
30 Tuesday, April 12, 1952 AD 3:30:42pm PST 1952-04-12
31 November 5, 1994, 8:15:30 am EST 1994-11-05
32 3rd of May 2001 2001-05-03
33 5:50 AM on June 13, 1990 1990-06-13
34 NULL NaN
35 nan NaN
36 I'm a little cat NaN
37 This is Sep. NaN

Example format: yyyy.MM.dd AD at HH:mm:ss Z

[5]:
clean_date(df, 'date', output_format='yyyy.MM.dd AD at HH:mm:ss Z')
Dates Cleaning Report:
        34 values cleaned (89.47%)
        2 values unable to be parsed (5.26%), set to NaN
Result contains 34 (89.47%) values in the correct format and 4 null values (10.53%)
[5]:
date date_clean
0 1996.07.10 AD at 15:08:56 PDT 1996.07.10 AD at 15:08:56 UTC+00:00
1 Thu Sep 25 10:36:28 2003 2003.09.25 AD at 10:36:28 UTC+00:00
2 Thu Sep 25 10:36:28 BRST 2003 2003.09.25 AD at 10:36:28 UTC+00:00
3 2003 10:36:28 BRST 25 Sep Thu 2003.09.25 AD at 10:36:28 UTC+00:00
4 Thu Sep 25 10:36:28 2003 2003.09.25 AD at 10:36:28 UTC+00:00
5 Thu 10:36:28 2000.01.01 AD at 10:36:28 UTC+00:00
6 Thu 10:36 2000.01.01 AD at 10:36:00 UTC+00:00
7 10:36 2000.01.01 AD at 10:36:00 UTC+00:00
8 Thu Sep 25 2003 2003.09.25 AD at 00:00:00 UTC+00:00
9 Sep 25 2003 2003.09.25 AD at 00:00:00 UTC+00:00
10 Sep 2003 2003.09.01 AD at 00:00:00 UTC+00:00
11 Sep 2000.09.01 AD at 00:00:00 UTC+00:00
12 2003 2003.01.01 AD at 00:00:00 UTC+00:00
13 2003-09-25 2003.09.25 AD at 00:00:00 UTC+00:00
14 2003-Sep-25 2003.09.25 AD at 00:00:00 UTC+00:00
15 25-Sep-2003 2003.09.25 AD at 00:00:00 UTC+00:00
16 Sep-25-2003 2003.09.25 AD at 00:00:00 UTC+00:00
17 09-25-2003 2003.09.25 AD at 00:00:00 UTC+00:00
18 10-09-2003 2003.10.09 AD at 00:00:00 UTC+00:00
19 10-09-03 2003.10.09 AD at 00:00:00 UTC+00:00
20 2003.Sep.25 2003.09.25 AD at 00:00:00 UTC+00:00
21 2003/09/25 2003.09.25 AD at 00:00:00 UTC+00:00
22 2003 Sep 25 2003.09.25 AD at 00:00:00 UTC+00:00
23 2003 09 25 2003.09.25 AD at 00:00:00 UTC+00:00
24 10pm 2000.01.01 AD at 22:00:00 UTC+00:00
25 12:00am 2000.01.01 AD at 12:00:00 UTC+00:00
26 Sep 03 2003.09.01 AD at 00:00:00 UTC+00:00
27 Sep of 03 2003.09.01 AD at 00:00:00 UTC+00:00
28 Wed, July 10, 96 2096.07.10 AD at 00:00:00 UTC+00:00
29 1996.07.10 AD at 15:08:56 PDT 1996.07.10 AD at 15:08:56 UTC+00:00
30 Tuesday, April 12, 1952 AD 3:30:42pm PST 1952.04.12 AD at 15:30:42 UTC+00:00
31 November 5, 1994, 8:15:30 am EST 1994.11.05 AD at 08:15:30 UTC+00:00
32 3rd of May 2001 2001.05.03 AD at 00:00:00 UTC+00:00
33 5:50 AM on June 13, 1990 1990.06.13 AD at 05:50:00 UTC+00:00
34 NULL NaN
35 nan NaN
36 I'm a little cat NaN
37 This is Sep. NaN

Example format: yyyy.MM.dd AD at HH:mm:ss z

[6]:
clean_date(df, 'date', output_format='yyyy.MM.dd AD at HH:mm:ss z')
Dates Cleaning Report:
        34 values cleaned (89.47%)
        2 values unable to be parsed (5.26%), set to NaN
Result contains 34 (89.47%) values in the correct format and 4 null values (10.53%)
[6]:
date date_clean
0 1996.07.10 AD at 15:08:56 PDT 1996.07.10 AD at 15:08:56 UTC
1 Thu Sep 25 10:36:28 2003 2003.09.25 AD at 10:36:28 UTC
2 Thu Sep 25 10:36:28 BRST 2003 2003.09.25 AD at 10:36:28 UTC
3 2003 10:36:28 BRST 25 Sep Thu 2003.09.25 AD at 10:36:28 UTC
4 Thu Sep 25 10:36:28 2003 2003.09.25 AD at 10:36:28 UTC
5 Thu 10:36:28 2000.01.01 AD at 10:36:28 UTC
6 Thu 10:36 2000.01.01 AD at 10:36:00 UTC
7 10:36 2000.01.01 AD at 10:36:00 UTC
8 Thu Sep 25 2003 2003.09.25 AD at 00:00:00 UTC
9 Sep 25 2003 2003.09.25 AD at 00:00:00 UTC
10 Sep 2003 2003.09.01 AD at 00:00:00 UTC
11 Sep 2000.09.01 AD at 00:00:00 UTC
12 2003 2003.01.01 AD at 00:00:00 UTC
13 2003-09-25 2003.09.25 AD at 00:00:00 UTC
14 2003-Sep-25 2003.09.25 AD at 00:00:00 UTC
15 25-Sep-2003 2003.09.25 AD at 00:00:00 UTC
16 Sep-25-2003 2003.09.25 AD at 00:00:00 UTC
17 09-25-2003 2003.09.25 AD at 00:00:00 UTC
18 10-09-2003 2003.10.09 AD at 00:00:00 UTC
19 10-09-03 2003.10.09 AD at 00:00:00 UTC
20 2003.Sep.25 2003.09.25 AD at 00:00:00 UTC
21 2003/09/25 2003.09.25 AD at 00:00:00 UTC
22 2003 Sep 25 2003.09.25 AD at 00:00:00 UTC
23 2003 09 25 2003.09.25 AD at 00:00:00 UTC
24 10pm 2000.01.01 AD at 22:00:00 UTC
25 12:00am 2000.01.01 AD at 12:00:00 UTC
26 Sep 03 2003.09.01 AD at 00:00:00 UTC
27 Sep of 03 2003.09.01 AD at 00:00:00 UTC
28 Wed, July 10, 96 2096.07.10 AD at 00:00:00 UTC
29 1996.07.10 AD at 15:08:56 PDT 1996.07.10 AD at 15:08:56 UTC
30 Tuesday, April 12, 1952 AD 3:30:42pm PST 1952.04.12 AD at 15:30:42 UTC
31 November 5, 1994, 8:15:30 am EST 1994.11.05 AD at 08:15:30 UTC
32 3rd of May 2001 2001.05.03 AD at 00:00:00 UTC
33 5:50 AM on June 13, 1990 1990.06.13 AD at 05:50:00 UTC
34 NULL NaN
35 nan NaN
36 I'm a little cat NaN
37 This is Sep. NaN

Example format: EEE, d MMM yyyy HH:mm:ss Z

[7]:
clean_date(df, 'date', output_format='EEE, d MMM yyyy HH:mm:ss Z')
Dates Cleaning Report:
        34 values cleaned (89.47%)
        2 values unable to be parsed (5.26%), set to NaN
Result contains 34 (89.47%) values in the correct format and 4 null values (10.53%)
[7]:
date date_clean
0 1996.07.10 AD at 15:08:56 PDT Wed, 10 Jul 1996 15:08:56 UTC+00:00
1 Thu Sep 25 10:36:28 2003 Thu, 25 Sep 2003 10:36:28 UTC+00:00
2 Thu Sep 25 10:36:28 BRST 2003 Thu, 25 Sep 2003 10:36:28 UTC+00:00
3 2003 10:36:28 BRST 25 Sep Thu Thu, 25 Sep 2003 10:36:28 UTC+00:00
4 Thu Sep 25 10:36:28 2003 Thu, 25 Sep 2003 10:36:28 UTC+00:00
5 Thu 10:36:28 Thu, 1 Jan 2000 10:36:28 UTC+00:00
6 Thu 10:36 Thu, 1 Jan 2000 10:36:00 UTC+00:00
7 10:36 Sat, 1 Jan 2000 10:36:00 UTC+00:00
8 Thu Sep 25 2003 Thu, 25 Sep 2003 00:00:00 UTC+00:00
9 Sep 25 2003 Thu, 25 Sep 2003 00:00:00 UTC+00:00
10 Sep 2003 Mon, 1 Sep 2003 00:00:00 UTC+00:00
11 Sep Fri, 1 Sep 2000 00:00:00 UTC+00:00
12 2003 Wed, 1 Jan 2003 00:00:00 UTC+00:00
13 2003-09-25 Thu, 25 Sep 2003 00:00:00 UTC+00:00
14 2003-Sep-25 Thu, 25 Sep 2003 00:00:00 UTC+00:00
15 25-Sep-2003 Thu, 25 Sep 2003 00:00:00 UTC+00:00
16 Sep-25-2003 Thu, 25 Sep 2003 00:00:00 UTC+00:00
17 09-25-2003 Thu, 25 Sep 2003 00:00:00 UTC+00:00
18 10-09-2003 Thu, 9 Oct 2003 00:00:00 UTC+00:00
19 10-09-03 Thu, 9 Oct 2003 00:00:00 UTC+00:00
20 2003.Sep.25 Thu, 25 Sep 2003 00:00:00 UTC+00:00
21 2003/09/25 Thu, 25 Sep 2003 00:00:00 UTC+00:00
22 2003 Sep 25 Thu, 25 Sep 2003 00:00:00 UTC+00:00
23 2003 09 25 Thu, 25 Sep 2003 00:00:00 UTC+00:00
24 10pm Sat, 1 Jan 2000 22:00:00 UTC+00:00
25 12:00am Sat, 1 Jan 2000 12:00:00 UTC+00:00
26 Sep 03 Mon, 1 Sep 2003 00:00:00 UTC+00:00
27 Sep of 03 Mon, 1 Sep 2003 00:00:00 UTC+00:00
28 Wed, July 10, 96 Wed, 10 Jul 2096 00:00:00 UTC+00:00
29 1996.07.10 AD at 15:08:56 PDT Wed, 10 Jul 1996 15:08:56 UTC+00:00
30 Tuesday, April 12, 1952 AD 3:30:42pm PST Tue, 12 Apr 1952 15:30:42 UTC+00:00
31 November 5, 1994, 8:15:30 am EST Sat, 5 Nov 1994 08:15:30 UTC+00:00
32 3rd of May 2001 Thu, 3 May 2001 00:00:00 UTC+00:00
33 5:50 AM on June 13, 1990 Wed, 13 Jun 1990 05:50:00 UTC+00:00
34 NULL NaN
35 nan NaN
36 I'm a little cat NaN
37 This is Sep. NaN

3. input_timezone and output_timezone parameter

This section demostrates valide origin timezones and target timezones. input_timezone in our function means user-specified timezone for input data. output_timezone in our function means user-specified timezone for output data.

In our function, the range of input_timezone and output_timezone includes two parts: * All timezones in pytz.all_timezones * Abbreviation for common-used timezones

Timezone Name

UTC offset

UTC

0

ACT

-5

ADT

-3

AEDT

11

AEST

10

AKDT

-8

AKST

-9

AMST

-3

AMT

-4

ART

-3

ArabiaST

3

AtlanticST

-4

AWST

8

AZOST

0

AZOT

0

BOT

-4

BRST

-2

BRT

-3

BST

1

BTT

6

CAT

2

CDT

-5

CEST

2

CET

1

CHOST

9

CHOT

8

CHUT

10

CKT

-10

CLST

-3

CLT

-4

CentralST

-6

ChinaST

8

CubaST

-5

ChST

10

EASST

-5

EAST

-6

EAT

3

ECT

-5

EDT

-4

EEST

3

EET

2

EST

-5

FKST

-3

GFT

-3

GILT

12

GMT

0

GST

4

HKT

8

HST

-10

ICT

7

IDT

3

IrishST

1

IsraelST

2

JST

9

KOST

11

LINT

4

MDT

-6

MHT

12

MSK

3

MST

-7

MYT

8

NUT

-11

NZDT

13

NZST

12

PDT

-7

PET

-5

PGT

10

PHT

8

PONT

11

PST

-8

SAST

2

SBT

11

SGT

8

SRT

-3

SST

-11

TAHT

-10

TLT

9

TVT

12

ULAST

9

ULAT

8

UYST

-2

UYT

-3

VET

-4

WAST

2

WAT

1

WEST

1

WET

0

WIB

7

WIT

9

WITA

8

Example format:

input_timezone: PDT

output_timezone: ChinaST

output_format: yyyy.MM.dd AD at HH:mm:ss Z

[8]:
clean_date(df, 'date', input_timezone='PDT', output_timezone='ChinaST',output_format='yyyy.MM.dd AD at HH:mm:ss Z')
Dates Cleaning Report:
        34 values cleaned (89.47%)
        2 values unable to be parsed (5.26%), set to NaN
Result contains 34 (89.47%) values in the correct format and 4 null values (10.53%)
[8]:
date date_clean
0 1996.07.10 AD at 15:08:56 PDT 1996.07.11 AD at 06:08:56 UTC+08:00
1 Thu Sep 25 10:36:28 2003 2003.09.26 AD at 01:36:28 UTC+08:00
2 Thu Sep 25 10:36:28 BRST 2003 2003.09.26 AD at 01:36:28 UTC+08:00
3 2003 10:36:28 BRST 25 Sep Thu 2003.09.26 AD at 01:36:28 UTC+08:00
4 Thu Sep 25 10:36:28 2003 2003.09.26 AD at 01:36:28 UTC+08:00
5 Thu 10:36:28 2000.01.02 AD at 01:36:28 UTC+08:00
6 Thu 10:36 2000.01.02 AD at 01:36:00 UTC+08:00
7 10:36 2000.01.02 AD at 01:36:00 UTC+08:00
8 Thu Sep 25 2003 2003.09.25 AD at 15:00:00 UTC+08:00
9 Sep 25 2003 2003.09.25 AD at 15:00:00 UTC+08:00
10 Sep 2003 2003.09.01 AD at 15:00:00 UTC+08:00
11 Sep 2000.09.01 AD at 15:00:00 UTC+08:00
12 2003 2003.01.01 AD at 15:00:00 UTC+08:00
13 2003-09-25 2003.09.25 AD at 15:00:00 UTC+08:00
14 2003-Sep-25 2003.09.25 AD at 15:00:00 UTC+08:00
15 25-Sep-2003 2003.09.25 AD at 15:00:00 UTC+08:00
16 Sep-25-2003 2003.09.25 AD at 15:00:00 UTC+08:00
17 09-25-2003 2003.09.25 AD at 15:00:00 UTC+08:00
18 10-09-2003 2003.10.09 AD at 15:00:00 UTC+08:00
19 10-09-03 2003.10.09 AD at 15:00:00 UTC+08:00
20 2003.Sep.25 2003.09.25 AD at 15:00:00 UTC+08:00
21 2003/09/25 2003.09.25 AD at 15:00:00 UTC+08:00
22 2003 Sep 25 2003.09.25 AD at 15:00:00 UTC+08:00
23 2003 09 25 2003.09.25 AD at 15:00:00 UTC+08:00
24 10pm 2000.01.02 AD at 13:00:00 UTC+08:00
25 12:00am 2000.01.02 AD at 03:00:00 UTC+08:00
26 Sep 03 2003.09.01 AD at 15:00:00 UTC+08:00
27 Sep of 03 2003.09.01 AD at 15:00:00 UTC+08:00
28 Wed, July 10, 96 2096.07.10 AD at 15:00:00 UTC+08:00
29 1996.07.10 AD at 15:08:56 PDT 1996.07.11 AD at 06:08:56 UTC+08:00
30 Tuesday, April 12, 1952 AD 3:30:42pm PST 1952.04.13 AD at 06:30:42 UTC+08:00
31 November 5, 1994, 8:15:30 am EST 1994.11.05 AD at 23:15:30 UTC+08:00
32 3rd of May 2001 2001.05.03 AD at 15:00:00 UTC+08:00
33 5:50 AM on June 13, 1990 1990.06.13 AD at 20:50:00 UTC+08:00
34 NULL NaN
35 nan NaN
36 I'm a little cat NaN
37 This is Sep. NaN

Example format:

input_timezone: EST

output_timezone: PDT

output_format: yyyy.MM.dd AD at HH:mm:ss Z

[9]:
clean_date(df, 'date', input_timezone='EST', output_timezone='PDT',output_format='yyyy.MM.dd AD at HH:mm:ss Z')
Dates Cleaning Report:
        34 values cleaned (89.47%)
        2 values unable to be parsed (5.26%), set to NaN
Result contains 34 (89.47%) values in the correct format and 4 null values (10.53%)
[9]:
date date_clean
0 1996.07.10 AD at 15:08:56 PDT 1996.07.12 AD at 03:08:56 UTC-07:00
1 Thu Sep 25 10:36:28 2003 2003.09.26 AD at 22:36:28 UTC-07:00
2 Thu Sep 25 10:36:28 BRST 2003 2003.09.26 AD at 22:36:28 UTC-07:00
3 2003 10:36:28 BRST 25 Sep Thu 2003.09.26 AD at 22:36:28 UTC-07:00
4 Thu Sep 25 10:36:28 2003 2003.09.26 AD at 22:36:28 UTC-07:00
5 Thu 10:36:28 2000.01.02 AD at 22:36:28 UTC-07:00
6 Thu 10:36 2000.01.02 AD at 22:36:00 UTC-07:00
7 10:36 2000.01.02 AD at 22:36:00 UTC-07:00
8 Thu Sep 25 2003 2003.09.26 AD at 12:00:00 UTC-07:00
9 Sep 25 2003 2003.09.26 AD at 12:00:00 UTC-07:00
10 Sep 2003 2003.09.02 AD at 12:00:00 UTC-07:00
11 Sep 2000.09.02 AD at 12:00:00 UTC-07:00
12 2003 2003.01.02 AD at 12:00:00 UTC-07:00
13 2003-09-25 2003.09.26 AD at 12:00:00 UTC-07:00
14 2003-Sep-25 2003.09.26 AD at 12:00:00 UTC-07:00
15 25-Sep-2003 2003.09.26 AD at 12:00:00 UTC-07:00
16 Sep-25-2003 2003.09.26 AD at 12:00:00 UTC-07:00
17 09-25-2003 2003.09.26 AD at 12:00:00 UTC-07:00
18 10-09-2003 2003.10.10 AD at 12:00:00 UTC-07:00
19 10-09-03 2003.10.10 AD at 12:00:00 UTC-07:00
20 2003.Sep.25 2003.09.26 AD at 12:00:00 UTC-07:00
21 2003/09/25 2003.09.26 AD at 12:00:00 UTC-07:00
22 2003 Sep 25 2003.09.26 AD at 12:00:00 UTC-07:00
23 2003 09 25 2003.09.26 AD at 12:00:00 UTC-07:00
24 10pm 2000.01.03 AD at 10:00:00 UTC-07:00
25 12:00am 2000.01.03 AD at 00:00:00 UTC-07:00
26 Sep 03 2003.09.02 AD at 12:00:00 UTC-07:00
27 Sep of 03 2003.09.02 AD at 12:00:00 UTC-07:00
28 Wed, July 10, 96 2096.07.11 AD at 12:00:00 UTC-07:00
29 1996.07.10 AD at 15:08:56 PDT 1996.07.12 AD at 03:08:56 UTC-07:00
30 Tuesday, April 12, 1952 AD 3:30:42pm PST 1952.04.14 AD at 03:30:42 UTC-07:00
31 November 5, 1994, 8:15:30 am EST 1994.11.06 AD at 20:15:30 UTC-07:00
32 3rd of May 2001 2001.05.04 AD at 12:00:00 UTC-07:00
33 5:50 AM on June 13, 1990 1990.06.14 AD at 17:50:00 UTC-07:00
34 NULL NaN
35 nan NaN
36 I'm a little cat NaN
37 This is Sep. NaN

Example format:

input_timezone: PST

output_timezone: GMT

output_format: yyyy.MM.dd AD at HH:mm:ss Z

[10]:
clean_date(df, 'date', input_timezone='PST', output_timezone='GMT',output_format='yyyy.MM.dd AD at HH:mm:ss Z')
Dates Cleaning Report:
        34 values cleaned (89.47%)
        2 values unable to be parsed (5.26%), set to NaN
Result contains 34 (89.47%) values in the correct format and 4 null values (10.53%)
[10]:
date date_clean
0 1996.07.10 AD at 15:08:56 PDT 1996.07.10 AD at 23:08:56 UTC+00:00
1 Thu Sep 25 10:36:28 2003 2003.09.25 AD at 18:36:28 UTC+00:00
2 Thu Sep 25 10:36:28 BRST 2003 2003.09.25 AD at 18:36:28 UTC+00:00
3 2003 10:36:28 BRST 25 Sep Thu 2003.09.25 AD at 18:36:28 UTC+00:00
4 Thu Sep 25 10:36:28 2003 2003.09.25 AD at 18:36:28 UTC+00:00
5 Thu 10:36:28 2000.01.01 AD at 18:36:28 UTC+00:00
6 Thu 10:36 2000.01.01 AD at 18:36:00 UTC+00:00
7 10:36 2000.01.01 AD at 18:36:00 UTC+00:00
8 Thu Sep 25 2003 2003.09.25 AD at 08:00:00 UTC+00:00
9 Sep 25 2003 2003.09.25 AD at 08:00:00 UTC+00:00
10 Sep 2003 2003.09.01 AD at 08:00:00 UTC+00:00
11 Sep 2000.09.01 AD at 08:00:00 UTC+00:00
12 2003 2003.01.01 AD at 08:00:00 UTC+00:00
13 2003-09-25 2003.09.25 AD at 08:00:00 UTC+00:00
14 2003-Sep-25 2003.09.25 AD at 08:00:00 UTC+00:00
15 25-Sep-2003 2003.09.25 AD at 08:00:00 UTC+00:00
16 Sep-25-2003 2003.09.25 AD at 08:00:00 UTC+00:00
17 09-25-2003 2003.09.25 AD at 08:00:00 UTC+00:00
18 10-09-2003 2003.10.09 AD at 08:00:00 UTC+00:00
19 10-09-03 2003.10.09 AD at 08:00:00 UTC+00:00
20 2003.Sep.25 2003.09.25 AD at 08:00:00 UTC+00:00
21 2003/09/25 2003.09.25 AD at 08:00:00 UTC+00:00
22 2003 Sep 25 2003.09.25 AD at 08:00:00 UTC+00:00
23 2003 09 25 2003.09.25 AD at 08:00:00 UTC+00:00
24 10pm 2000.01.02 AD at 06:00:00 UTC+00:00
25 12:00am 2000.01.01 AD at 20:00:00 UTC+00:00
26 Sep 03 2003.09.01 AD at 08:00:00 UTC+00:00
27 Sep of 03 2003.09.01 AD at 08:00:00 UTC+00:00
28 Wed, July 10, 96 2096.07.10 AD at 08:00:00 UTC+00:00
29 1996.07.10 AD at 15:08:56 PDT 1996.07.10 AD at 23:08:56 UTC+00:00
30 Tuesday, April 12, 1952 AD 3:30:42pm PST 1952.04.12 AD at 23:30:42 UTC+00:00
31 November 5, 1994, 8:15:30 am EST 1994.11.05 AD at 16:15:30 UTC+00:00
32 3rd of May 2001 2001.05.03 AD at 08:00:00 UTC+00:00
33 5:50 AM on June 13, 1990 1990.06.13 AD at 13:50:00 UTC+00:00
34 NULL NaN
35 nan NaN
36 I'm a little cat NaN
37 This is Sep. NaN

4. fix_missing parameter

This section demostrates valid options of fix_missing parameter. The user can specify the way of fixing empty value from value set: {‘empty’, ‘current’, ‘minimum’}. The default fixed_empty is 'minimum'

minimum

  • For hours, minutes and seconds, just fill them with zeros

  • For years, months and days, fill it with the minimum value

    • Min value of year: 2000

    • Min value of month: 1

    • Min value of day: 1

[12]:
clean_date(df, 'date', fix_missing='minimum')
Dates Cleaning Report:
        34 values cleaned (89.47%)
        2 values unable to be parsed (5.26%), set to NaN
Result contains 34 (89.47%) values in the correct format and 4 null values (10.53%)
[12]:
date date_clean
0 1996.07.10 AD at 15:08:56 PDT 1996-07-10 15:08:56
1 Thu Sep 25 10:36:28 2003 2003-09-25 10:36:28
2 Thu Sep 25 10:36:28 BRST 2003 2003-09-25 10:36:28
3 2003 10:36:28 BRST 25 Sep Thu 2003-09-25 10:36:28
4 Thu Sep 25 10:36:28 2003 2003-09-25 10:36:28
5 Thu 10:36:28 2000-01-01 10:36:28
6 Thu 10:36 2000-01-01 10:36:00
7 10:36 2000-01-01 10:36:00
8 Thu Sep 25 2003 2003-09-25 00:00:00
9 Sep 25 2003 2003-09-25 00:00:00
10 Sep 2003 2003-09-01 00:00:00
11 Sep 2000-09-01 00:00:00
12 2003 2003-01-01 00:00:00
13 2003-09-25 2003-09-25 00:00:00
14 2003-Sep-25 2003-09-25 00:00:00
15 25-Sep-2003 2003-09-25 00:00:00
16 Sep-25-2003 2003-09-25 00:00:00
17 09-25-2003 2003-09-25 00:00:00
18 10-09-2003 2003-10-09 00:00:00
19 10-09-03 2003-10-09 00:00:00
20 2003.Sep.25 2003-09-25 00:00:00
21 2003/09/25 2003-09-25 00:00:00
22 2003 Sep 25 2003-09-25 00:00:00
23 2003 09 25 2003-09-25 00:00:00
24 10pm 2000-01-01 22:00:00
25 12:00am 2000-01-01 12:00:00
26 Sep 03 2003-09-01 00:00:00
27 Sep of 03 2003-09-01 00:00:00
28 Wed, July 10, 96 2096-07-10 00:00:00
29 1996.07.10 AD at 15:08:56 PDT 1996-07-10 15:08:56
30 Tuesday, April 12, 1952 AD 3:30:42pm PST 1952-04-12 15:30:42
31 November 5, 1994, 8:15:30 am EST 1994-11-05 08:15:30
32 3rd of May 2001 2001-05-03 00:00:00
33 5:50 AM on June 13, 1990 1990-06-13 05:50:00
34 NULL NaN
35 nan NaN
36 I'm a little cat NaN
37 This is Sep. NaN

empty

Just left the missing component as it is

[13]:
clean_date(df, 'date', fix_missing='empty')
Dates Cleaning Report:
        34 values cleaned (89.47%)
        2 values unable to be parsed (5.26%), set to NaN
Result contains 34 (89.47%) values in the correct format and 4 null values (10.53%)
[13]:
date date_clean
0 1996.07.10 AD at 15:08:56 PDT 1996-07-10 15:08:56
1 Thu Sep 25 10:36:28 2003 2003-09-25 10:36:28
2 Thu Sep 25 10:36:28 BRST 2003 2003-09-25 10:36:28
3 2003 10:36:28 BRST 25 Sep Thu 2003-09-25 10:36:28
4 Thu Sep 25 10:36:28 2003 2003-09-25 10:36:28
5 Thu 10:36:28 ---------- 10:36:28
6 Thu 10:36 ---------- 10:36:--
7 10:36 ---------- 10:36:--
8 Thu Sep 25 2003 2003-09-25 --:--:--
9 Sep 25 2003 2003-09-25 --:--:--
10 Sep 2003 2003-09--- --:--:--
11 Sep -----09--- --:--:--
12 2003 2003------ --:--:--
13 2003-09-25 2003-09-25 --:--:--
14 2003-Sep-25 2003-09-25 --:--:--
15 25-Sep-2003 2003-09-25 --:--:--
16 Sep-25-2003 2003-09-25 --:--:--
17 09-25-2003 2003-09-25 --:--:--
18 10-09-2003 2003-10-09 --:--:--
19 10-09-03 2003-10-09 --:--:--
20 2003.Sep.25 2003-09-25 --:--:--
21 2003/09/25 2003-09-25 --:--:--
22 2003 Sep 25 2003-09-25 --:--:--
23 2003 09 25 2003-09-25 --:--:--
24 10pm ---------- 22:--:--
25 12:00am ---------- 12:00:--
26 Sep 03 2003-09--- --:--:--
27 Sep of 03 2003-09--- --:--:--
28 Wed, July 10, 96 2096-07-10 --:--:--
29 1996.07.10 AD at 15:08:56 PDT 1996-07-10 15:08:56
30 Tuesday, April 12, 1952 AD 3:30:42pm PST 1952-04-12 15:30:42
31 November 5, 1994, 8:15:30 am EST 1994-11-05 08:15:30
32 3rd of May 2001 2001-05-03 --:--:--
33 5:50 AM on June 13, 1990 1990-06-13 05:50:--
34 NULL NaN
35 nan NaN
36 I'm a little cat NaN
37 This is Sep. NaN

current

  • For hours, minutes and seconds, just fill them with nearest time value

  • For years, months and days, fill it with the nearest date

[14]:
clean_date(df, 'date', fix_missing='current')
Dates Cleaning Report:
        34 values cleaned (89.47%)
        2 values unable to be parsed (5.26%), set to NaN
Result contains 34 (89.47%) values in the correct format and 4 null values (10.53%)
[14]:
date date_clean
0 1996.07.10 AD at 15:08:56 PDT 1996-07-10 15:08:56
1 Thu Sep 25 10:36:28 2003 2003-09-25 10:36:28
2 Thu Sep 25 10:36:28 BRST 2003 2003-09-25 10:36:28
3 2003 10:36:28 BRST 25 Sep Thu 2003-09-25 10:36:28
4 Thu Sep 25 10:36:28 2003 2003-09-25 10:36:28
5 Thu 10:36:28 2021-05-13 10:36:28
6 Thu 10:36 2021-05-13 10:36:44
7 10:36 2021-05-13 10:36:44
8 Thu Sep 25 2003 2003-09-25 22:05:44
9 Sep 25 2003 2003-09-25 22:05:44
10 Sep 2003 2003-09-13 22:05:44
11 Sep 2021-09-13 22:05:44
12 2003 2003-05-13 22:05:44
13 2003-09-25 2003-09-25 22:05:44
14 2003-Sep-25 2003-09-25 22:05:44
15 25-Sep-2003 2003-09-25 22:05:44
16 Sep-25-2003 2003-09-25 22:05:44
17 09-25-2003 2003-09-25 22:05:44
18 10-09-2003 2003-10-09 22:05:44
19 10-09-03 2003-10-09 22:05:44
20 2003.Sep.25 2003-09-25 22:05:44
21 2003/09/25 2003-09-25 22:05:44
22 2003 Sep 25 2003-09-25 22:05:44
23 2003 09 25 2003-09-25 22:05:44
24 10pm 2021-05-13 22:05:44
25 12:00am 2021-05-13 12:00:44
26 Sep 03 2003-09-13 22:05:44
27 Sep of 03 2003-09-13 22:05:44
28 Wed, July 10, 96 2096-07-10 22:05:44
29 1996.07.10 AD at 15:08:56 PDT 1996-07-10 15:08:56
30 Tuesday, April 12, 1952 AD 3:30:42pm PST 1952-04-12 15:30:42
31 November 5, 1994, 8:15:30 am EST 1994-11-05 08:15:30
32 3rd of May 2001 2001-05-03 22:05:44
33 5:50 AM on June 13, 1990 1990-06-13 05:50:44
34 NULL NaN
35 nan NaN
36 I'm a little cat NaN
37 This is Sep. NaN

5. infer_day_first parameter

If infer_day_first = True, the clean_date funtion infers the day number in an ambiguous string column automatically.

If infer_day_first = False, do nothing.

By default, infer_day_first = True

[2]:
import pandas as pd
import numpy as np
df = pd.DataFrame(
    {'date': ['12-01-06', '12-04-06', '12-05-06',
              '20-12-06', '21-12-06', '29-12-06']})

from dataprep.clean import clean_date
clean_date(df, 'date')
Dates Cleaning Report:
        6 values cleaned (100.0%)
Result contains 6 (100.0%) values in the correct format and 0 null values (0.0%)
[2]:
date date_clean
0 12-01-06 2006-01-12 00:00:00
1 12-04-06 2006-04-12 00:00:00
2 12-05-06 2006-05-12 00:00:00
3 20-12-06 2006-12-20 00:00:00
4 21-12-06 2006-12-21 00:00:00
5 29-12-06 2006-12-29 00:00:00
[4]:
df = pd.DataFrame(
    {'date': ['12-01-06', '12-04-06', '12-05-06',
              '12-20-06', '12-21-06', '12-29-06']})

from dataprep.clean import clean_date
clean_date(df, 'date')
Dates Cleaning Report:
        6 values cleaned (100.0%)
Result contains 6 (100.0%) values in the correct format and 0 null values (0.0%)
[4]:
date date_clean
0 12-01-06 2006-12-01 00:00:00
1 12-04-06 2006-12-04 00:00:00
2 12-05-06 2006-12-05 00:00:00
3 12-20-06 2006-12-20 00:00:00
4 12-21-06 2006-12-21 00:00:00
5 12-29-06 2006-12-29 00:00:00
[3]:
from dataprep.clean import clean_date
clean_date(df, 'date', infer_day_first = False)
Dates Cleaning Report:
        6 values cleaned (100.0%)
Result contains 6 (100.0%) values in the correct format and 0 null values (0.0%)
[3]:
date date_clean
0 12-01-06 2006-12-01 00:00:00
1 12-04-06 2006-12-04 00:00:00
2 12-05-06 2006-12-05 00:00:00
3 20-12-06 2006-12-20 00:00:00
4 21-12-06 2006-12-21 00:00:00
5 29-12-06 2006-12-29 00:00:00

6. report parameter

If report = True, a report contains:

  • How many values are cleaned

  • How many values are unable to cleaned (due to their invalid format)

  • How many values are with correct format

  • How many null values are there

will be generated.

If report = False, the report won’t be generated.

[16]:
clean_date(df, 'date', report=True)
Dates Cleaning Report:
        34 values cleaned (89.47%)
        2 values unable to be parsed (5.26%), set to NaN
Result contains 34 (89.47%) values in the correct format and 4 null values (10.53%)
[16]:
date date_clean
0 1996.07.10 AD at 15:08:56 PDT 1996-07-10 15:08:56
1 Thu Sep 25 10:36:28 2003 2003-09-25 10:36:28
2 Thu Sep 25 10:36:28 BRST 2003 2003-09-25 10:36:28
3 2003 10:36:28 BRST 25 Sep Thu 2003-09-25 10:36:28
4 Thu Sep 25 10:36:28 2003 2003-09-25 10:36:28
5 Thu 10:36:28 2000-01-01 10:36:28
6 Thu 10:36 2000-01-01 10:36:00
7 10:36 2000-01-01 10:36:00
8 Thu Sep 25 2003 2003-09-25 00:00:00
9 Sep 25 2003 2003-09-25 00:00:00
10 Sep 2003 2003-09-01 00:00:00
11 Sep 2000-09-01 00:00:00
12 2003 2003-01-01 00:00:00
13 2003-09-25 2003-09-25 00:00:00
14 2003-Sep-25 2003-09-25 00:00:00
15 25-Sep-2003 2003-09-25 00:00:00
16 Sep-25-2003 2003-09-25 00:00:00
17 09-25-2003 2003-09-25 00:00:00
18 10-09-2003 2003-10-09 00:00:00
19 10-09-03 2003-10-09 00:00:00
20 2003.Sep.25 2003-09-25 00:00:00
21 2003/09/25 2003-09-25 00:00:00
22 2003 Sep 25 2003-09-25 00:00:00
23 2003 09 25 2003-09-25 00:00:00
24 10pm 2000-01-01 22:00:00
25 12:00am 2000-01-01 12:00:00
26 Sep 03 2003-09-01 00:00:00
27 Sep of 03 2003-09-01 00:00:00
28 Wed, July 10, 96 2096-07-10 00:00:00
29 1996.07.10 AD at 15:08:56 PDT 1996-07-10 15:08:56
30 Tuesday, April 12, 1952 AD 3:30:42pm PST 1952-04-12 15:30:42
31 November 5, 1994, 8:15:30 am EST 1994-11-05 08:15:30
32 3rd of May 2001 2001-05-03 00:00:00
33 5:50 AM on June 13, 1990 1990-06-13 05:50:00
34 NULL NaN
35 nan NaN
36 I'm a little cat NaN
37 This is Sep. NaN
[17]:
clean_date(df, 'date', report=False)
[17]:
date date_clean
0 1996.07.10 AD at 15:08:56 PDT 1996-07-10 15:08:56
1 Thu Sep 25 10:36:28 2003 2003-09-25 10:36:28
2 Thu Sep 25 10:36:28 BRST 2003 2003-09-25 10:36:28
3 2003 10:36:28 BRST 25 Sep Thu 2003-09-25 10:36:28
4 Thu Sep 25 10:36:28 2003 2003-09-25 10:36:28
5 Thu 10:36:28 2000-01-01 10:36:28
6 Thu 10:36 2000-01-01 10:36:00
7 10:36 2000-01-01 10:36:00
8 Thu Sep 25 2003 2003-09-25 00:00:00
9 Sep 25 2003 2003-09-25 00:00:00
10 Sep 2003 2003-09-01 00:00:00
11 Sep 2000-09-01 00:00:00
12 2003 2003-01-01 00:00:00
13 2003-09-25 2003-09-25 00:00:00
14 2003-Sep-25 2003-09-25 00:00:00
15 25-Sep-2003 2003-09-25 00:00:00
16 Sep-25-2003 2003-09-25 00:00:00
17 09-25-2003 2003-09-25 00:00:00
18 10-09-2003 2003-10-09 00:00:00
19 10-09-03 2003-10-09 00:00:00
20 2003.Sep.25 2003-09-25 00:00:00
21 2003/09/25 2003-09-25 00:00:00
22 2003 Sep 25 2003-09-25 00:00:00
23 2003 09 25 2003-09-25 00:00:00
24 10pm 2000-01-01 22:00:00
25 12:00am 2000-01-01 12:00:00
26 Sep 03 2003-09-01 00:00:00
27 Sep of 03 2003-09-01 00:00:00
28 Wed, July 10, 96 2096-07-10 00:00:00
29 1996.07.10 AD at 15:08:56 PDT 1996-07-10 15:08:56
30 Tuesday, April 12, 1952 AD 3:30:42pm PST 1952-04-12 15:30:42
31 November 5, 1994, 8:15:30 am EST 1994-11-05 08:15:30
32 3rd of May 2001 2001-05-03 00:00:00
33 5:50 AM on June 13, 1990 1990-06-13 05:50:00
34 NULL NaN
35 nan NaN
36 I'm a little cat NaN
37 This is Sep. NaN

7. validate_date()

validate_date() returns True when the input has a valid date format. Otherwise the function returns False.

[20]:
from dataprep.clean import validate_date
print(validate_date("Novvvvvvvvember 5, 1994, 8:15:30 am EST hahaha"))
print(validate_date("1994, 8:15:30"))
print(validate_date("Hello."))
False
True
False
[21]:
df = pd.DataFrame({"messy_date":
                   ["T, Ap 12, 1952 AD 3:30:42p", "5:50 AM on June 13, 1990", "3rd of May 2001", "55/23/2014",
                    "10pm", "10p", "2003-Sep-25",
                    "Sepppppp", "23 4 1962", "2003 10:36:28 BRST 25 Sep Thu",
                    "hello", np.nan, "NULL"]
                  })
df["valid"] = validate_date(df["messy_date"])
df
[21]:
messy_date valid
0 T, Ap 12, 1952 AD 3:30:42p False
1 5:50 AM on June 13, 1990 True
2 3rd of May 2001 True
3 55/23/2014 True
4 10pm True
5 10p True
6 2003-Sep-25 True
7 Sepppppp False
8 23 4 1962 True
9 2003 10:36:28 BRST 25 Sep Thu True
10 hello False
11 NaN False
12 NULL False
[ ]: