There are several ways to compare dates in a pandas DataFrame, and the approach you choose will depend on the problem you are trying to solve. Example
Initial steps before comparing two dates (Case Study)
Suppose you're employed at NOAA and have received a CSV file that contains data about observations made by an Earth observation satellite (You can download the CSV file used here):
Open csv file
The initial step is to save the CSV information into a DataFrame:
import pandas as pd
filename = 'NOAA_Williams_Flats.csv'
df = pd.read_csv(filename)
Output
We are interested in the column labeled "observation date/time"
Check columns types
It is recommended to verify the column types before comparing two dates in a pandas dataframe.
df.dtypes
which will give:
fire incident object
satellite scene id int64
latitude float64
longitude float64
image start date/time object
observation date/time object
satellite object
instrument object
scan domain object
FRP [MW] float64
MWIR brightness temp [K] float64
LWIR brightness temp [K] float64
pixel area [m^2] float64
solar zenith angle [deg] float64
dtype: object
The column "Observation Date/Time" is a data type called "object," which means it contains both numeric and non-numeric values.
Please note that in our example, it is common for the date to be stored in a format other than a datetime object. This applies specifically to our example.
Checking the number of rows for each date
It might helpful to explore the option of analyzing the number of rows for each date:
df['observation date/time'].value_counts()
gives
2019-08-08T05:41:25Z 48
2019-08-08T04:56:25Z 48
2019-08-08T05:01:24Z 48
2019-08-08T05:06:25Z 48
2019-08-08T04:41:24Z 48
..
2019-08-02T16:01:25Z 1
2019-08-02T15:56:24Z 1
2019-08-02T15:51:25Z 1
2019-08-02T15:46:24Z 1
2019-08-09T11:26:24Z 1
Name: observation date/time, Length: 1999, dtype: int64
Use the contains() method to choose the rows corresponding to a specific date.
If we only want to select rows in the dataframe that match the specific date of "2008-08-08," there is no need to convert it into a datetime object. We can simply use the "contains()" method:
df[ df['observation date/time'].str.contains('2019-08-08') ]
gives
df = df[ df['observation date/time'].str.contains('2019-08-08') ]
Converting dataframe column to a datetime type.
In order to compare two dates in Python, it is recommended to generate datetime objects. Therefore, a new datetime column should be created using the 'observation date/time':
To do that let's define a function:
import datetime
import re
def create_datetime_column(x):
obs_time = x['observation date/time']
m = re.search('(.*)-(.*)-(.*)T(.*):(.*):(.*)', obs_time)
year = int( m.group(1) )
month = int( m.group(2) )
day = int( m.group(3) )
hour = int( m.group(4) )
minute = int( m.group(5) )
return datetime.datetime(year, month, day, hour,minute)
and apply it to our dataframe:
df.apply(create_datetime_column, axis=1)
gives
22252 2019-08-08 00:01:00
22253 2019-08-08 00:01:00
22254 2019-08-08 00:01:00
22255 2019-08-08 00:01:00
22256 2019-08-08 00:01:00
...
29071 2019-08-08 23:56:00
29072 2019-08-08 23:56:00
29073 2019-08-08 23:56:00
29074 2019-08-08 23:56:00
29075 2019-08-08 23:56:00
Name: obs_time, Length: 6824, dtype: datetime64[ns]
Create a new column:
df['obs_time'] = df.apply(create_datetime_column, axis=1)
Comparing two dates
Suppose we intend to compare our data with another instrument at a specific time of observation.:
year = 2019
month = 8
day = 8
hour = 20
minute = 6
obs_datetime = datetime.datetime(year, month, day, hour,minute)
Using == operator
One way to determine if there are rows with identical observation times is by utilizing the "== operator."
df['obs_time'] == obs_datetime
returns a column of boolean values in a dataframe:
22252 False
22253 False
22254 False
22255 False
22256 False
...
29071 False
29072 False
29073 False
29074 False
29075 False
Name: obs_time, Length: 6824, dtype: bool
To only select the rows with identical observation times:
df[ df['obs_time'] == obs_datetime ][['latitude','longitude','observation date/time','obs_time','FRP [MW]']]
Using != operator
df['obs_time'] != obs_datetime
Select rows with different observation times
df[ df['obs_time'] != obs_datetime ]
Using > operator
df['obs_time'] > obs_datetime
choose the rows that correspond to observation made after.
df[ df['obs_time'] != obs_datetime ]
Using < operator
df['obs_time'] < obs_datetime
select rows that correspond to the previous observation
df[ df['obs_time'] != obs_datetime ]
Time difference
df['obs_time'] - obs_datetime
gives
22252 -1 days +03:55:00
22253 -1 days +03:55:00
22254 -1 days +03:55:00
22255 -1 days +03:55:00
22256 -1 days +03:55:00
...
29071 0 days 03:50:00
29072 0 days 03:50:00
29073 0 days 03:50:00
29074 0 days 03:50:00
29075 0 days 03:50:00
Name: obs_time, Length: 6824, dtype: timedelta64[ns]
Time difference in minutes
delta.dt.total_seconds() / 3600.0
gives
22252 -20.083333
22253 -20.083333
22254 -20.083333
22255 -20.083333
22256 -20.083333
...
29071 3.833333
29072 3.833333
29073 3.833333
29074 3.833333
29075 3.833333
Name: obs_time, Length: 6824, dtype: float64
References
Links | Site |
---|---|
contains() | pandas.pydata.org |
datetime | docs.python.org |
How to create a datetime column from year, month and day columns in pandas ? | moonbooks.org |