How to Extract Date Information from a String with Python? (Basic NLP in 3 Minutes)

We will explore how to use the Regular Expressions Regex library in python to perform simple NLP / text mining tasks such as extracting date information from a string.

Photo by Mark Rasmuson on Unsplash

There are many conventions to write a date, e.g.:

24–11–2020
24/11/2020
24/11/20
11/24/2020
24 Nov 2020
24 November 2020
Nov 24,2020
November 24, 2020
5–11–2020
5/11/2020
5/11/20
11/5/2020
5 Nov 2020
5 November 2020
Nov 5,2020
24–9–2020
24/9/2020
24/9/20
9/24/2020
24 Sep 2020
24 September 2020

If any one of the above dates appears in a sting,

how can we extract date information from a string?

print(dateString)
#We have a string looks like this:
A string that contains date information

First thing first, all the below codes are saved in this Google colab.

We first import the Regex library:

import re

Let’s first extract dates in the format of ‘XX-XX-XXXX’ (‘2 digits-2digits-4digits’) and ‘XX/XX/XXXX’ (‘2 digits/2digits/4digits’) and:

re.findall(r'\d{2}[/-]\d{2}[/-]\d{4}',dateString)

Output:

[‘24–11–2020’, ‘24/11/2020’, ‘11/24/2020’]

That is a good start, what if the year information is presented in a 2 digits format? d{2,4} extracts the year information in either a 2 digits format or a 4 digits format:

re.findall(r'\d{2}[/-]\d{2}[/-]\d{2,4}',dateString)

Output:

[‘24–11–2020’, ‘24/11/2020’, ‘24/11/20’, ‘11/24/2020’]

Okay, so what about months in a single-digit format(i.e. January to September) and the same for the days? d{1,2} can help:

re.findall(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}',dateString)

Output:

[‘24–11–2020’, ‘24/11/2020’, ‘24/11/20’, ‘11/24/2020’, ‘5–11–2020’, ‘5/11/2020’, ‘5/11/20’, ‘11/5/2020’, ‘24–9–2020’, ‘24/9/2020’, ‘24/9/20’, ‘9/24/2020’, ‘3–1–2020’, ‘3/1/2020’, ‘3/1/20’, ‘1/3/2020’]
Photo by Manasvita S on Unsplash

Cool! Isn’t it? Now, let’s work on cases where the months are spelled out. Let’s start with months written in a three-letter form:

re.findall(r'\d{1,2} (?:Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{2,4}',dateString)

Output:

[’24 Nov 2020', ‘5 Nov 2020’, ’24 Sep 2020', ‘3 Jan 2020’]

Now, let’s include the months that are written in the complete form:

re.findall(r’\d{1,2} (?:Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{2,4}’,dateString)

Output:

[’24 Nov 2020', ’24 November 2020', ‘5 Nov 2020’, ‘5 November 2020’, ’24 Sep 2020', ’24 September 2020', ‘3 Jan 2020’, ‘3 January 2020’]

[a-z]* above captures all small case alphabets that follow the months written in the three-letter abbreviate form. e.g. the “ember” part in the word “November”

Finally, let’s cover the situation where the “month” information may be in the front:

re.findall(r'(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{1,2}, )?\d{2,4}',dateString)

Output:

['24 Nov 2020','24 November 2020','Nov 24, 2020','November 24, 2020','5 Nov 2020','5 November 2020','Nov 5, 2020','November 5, 2020','24 Sep 2020','24 September 2020','Sep 24, 2020','September 24, 2020','3 Jan 2020','3 January 2020','Jan 3, 2020','January 3, 2020']

We saw applying Regex is an iterative process that takes tries until we finally get to what we need.

All codes are saved here.

So what next?

You have seen how we can apply Regex to solve basic NLP problems such as extracting the date information. And there are many other applications with Regex. Have fun exploring it!

Photo by Haisheng Lin, My Dad

Actuary | ML Practitioner | Apply Tomorrow's Technology to Solve Today's Problems

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store