Introduction
Data manipulation is one of the most common tasks in data analysis. In many cases, we need to combine two or more datasets to gain new insights into our data. This process is known as merging or joining data frames.
In this tutorial, we will explore how to merge two Pandas DataFrames based on the nearest DateTime using merge_asof() function. This functionality allows us to perform a "fuzzy" merge between two DataFrames on non-exact match keys, by matching the closest key values.
Case study (the Williams Flats Fire)
Consider a real-life scenario: imagine you are employed at organizations like NASA or NOAA. Let's suppose you have gathered observations of a fire event, such as the Williams Flats Fire, using different instruments. These observations are stored in two separate databases with varying dates of observation.:
import pandas as pddf_master = pd.DataFrame(master_data)print( df_master )
Output
datetime fire_event_id frp_total43 2019-08-03 21:51:00 9 2191.1620044 2019-08-03 22:20:00 9 896.1034046 2019-08-04 00:20:00 9 3383.2078047 2019-08-04 00:34:00 9 993.8557048 2019-08-04 02:40:00 9 1438.9114049 2019-08-06 18:51:00 9 186.7856451 2019-08-06 20:33:00 9 415.3580652 2019-08-06 21:53:00 9 767.1019058 2019-08-07 23:15:00 9 9265.2680059 2019-08-08 00:55:00 9 12978.8290060 2019-08-08 02:37:00 9 6355.1260061 2019-08-09 01:13:00 9 2730.6934062 2019-08-09 01:20:00 9 3374.9740063 2019-08-09 02:04:00 9 1615.5942064 2019-08-09 02:19:00 9 1416.48070df_viirs = pd.DataFrame(viirs_data)print(df_viirs)
Output
datetime fire_event_id FRP_agg134 2019-08-03 08:49:00 9 667.637268746 2019-08-03 09:40:00 9 647.292480138 2019-08-03 10:29:00 9 567.470764750 2019-08-03 11:20:00 9 849.016968753 2019-08-03 19:26:00 9 657.149475... ... ... ...1161 2019-09-01 07:18:00 1 0.000000560 2019-09-01 08:08:00 1 0.0000001171 2019-09-01 08:58:00 1 0.0000001179 2019-09-01 18:37:00 1 0.000000566 2019-09-01 19:27:00 1 0.000000[113 rows x 3 columns]
Now, your goal is to merge these databases, specifically focusing on aligning observations that are closest to each other within a given time frame.
If you would like to replicate this example, I have included the data at the end of this article.
Merging Pandas DataFrames based on the nearest DateTime using merge_asof()
Preparing the data
Before manipulating a dataframe, it's advisable to check the data types of each column using dtypes
df_viirs.dtypes
We observe that the column named "datetime" is not a datetime Python object:
datetime objectfire_event_id int64FRP_agg float64dtype: object
To convert it into a datetime, simply use the following code:
df_master['datetime'] = pd.to_datetime( df_master['datetime'] )df_viirs['datetime'] = pd.to_datetime( df_viirs['datetime'] )
If we check the types again
df_viirs.dtypes
we get
datetime datetime64[ns]fire_event_id int64FRP_agg float64dtype: object
Before merging the two dataframes, it can also be helpful to sort your data by datetime. In our example, the dates are already sorted, but this is not always the case. Failure to do so may result in the following error message:
ValueError: left keys must be sorted
To prevent this, simply sort your data using the following code:
df_master.sort_values(by=['datetime'],inplace=True)df_viirs.sort_values(by=['datetime'],inplace=True)
Merging the two Pandas DataFrames based on the nearest DateTime
Now that our data is prepared, we can merge our two dataframes. For instance, if we aim to identify observations within a 30-minute range, we can use the following code:
pd.merge_asof(df_master,df_viirs, on="datetime", by="fire_event_id", tolerance=pd.Timedelta("30m"))
This code merges the two dataframes based on the nearest datetime, with the VIIRS observation coming before the master observation:
datetime fire_event_id frp_total FRP_agg0 2019-08-03 21:51:00 9 2191.16200 NaN1 2019-08-03 22:20:00 9 896.10340 2185.7075202 2019-08-04 00:20:00 9 3383.20780 NaN3 2019-08-04 00:34:00 9 993.85570 NaN4 2019-08-04 02:40:00 9 1438.91140 NaN5 2019-08-06 18:51:00 9 186.78564 NaN6 2019-08-06 20:33:00 9 415.35806 472.8033457 2019-08-06 21:53:00 9 767.10190 689.3959358 2019-08-07 23:15:00 9 9265.26800 NaN9 2019-08-08 00:55:00 9 12978.82900 NaN10 2019-08-08 02:37:00 9 6355.12600 NaN11 2019-08-09 01:13:00 9 2730.69340 NaN12 2019-08-09 01:20:00 9 3374.97400 NaN13 2019-08-09 02:04:00 9 1615.59420 NaN14 2019-08-09 02:19:00 9 1416.48070 NaN
Note that:
- We can specify the direction of the merge (forward or backward) using
directionargument inmerge_asof(). By default, it performs a forward merge. - If the nearest key value is not found, the corresponding row will have NaN values for all columns from the other DataFrame.
To find an observation within 30 minutes, regardless of whether it's before or after, simply use the "direction" argument with "nearest."
pd.merge_asof(df_master,df_viirs, on="datetime", by="fire_event_id", tolerance=pd.Timedelta("30m"), direction="nearest")
This code merges the two dataframes based on the nearest datetime:
datetime fire_event_id frp_total FRP_agg0 2019-08-03 21:51:00 9 2191.16200 2185.7075201 2019-08-03 22:20:00 9 896.10340 2185.7075202 2019-08-04 00:20:00 9 3383.20780 NaN3 2019-08-04 00:34:00 9 993.85570 NaN4 2019-08-04 02:40:00 9 1438.91140 NaN5 2019-08-06 18:51:00 9 186.78564 111.6256186 2019-08-06 20:33:00 9 415.35806 472.8033457 2019-08-06 21:53:00 9 767.10190 689.3959358 2019-08-07 23:15:00 9 9265.26800 NaN9 2019-08-08 00:55:00 9 12978.82900 NaN10 2019-08-08 02:37:00 9 6355.12600 NaN11 2019-08-09 01:13:00 9 2730.69340 NaN12 2019-08-09 01:20:00 9 3374.97400 NaN13 2019-08-09 02:04:00 9 1615.59420 NaN14 2019-08-09 02:19:00 9 1416.48070 NaN
References
| Links | Site |
|---|---|
| pandas.merge_asof | pandas.pydata.org |
| pandas.DataFrame.dtypes | pandas.pydata.org |
| pandas.to_datetime | pandas.pydata.org |
| pandas.DataFrame.sort_values | pandas.pydata.org |
Williams Flats Fire Datasets
VIIRS Data
viirs_data = {'datetime': {134: '2019-08-03 08:49:00',746: '2019-08-03 09:40:00',138: '2019-08-03 10:29:00',750: '2019-08-03 11:20:00',753: '2019-08-03 19:26:00',146: '2019-08-03 20:16:00',755: '2019-08-03 21:06:00',147: '2019-08-03 21:57:00',759: '2019-08-04 09:21:00',151: '2019-08-04 10:11:00',763: '2019-08-04 11:01:00',156: '2019-08-04 19:57:00',769: '2019-08-04 20:47:00',772: '2019-08-04 20:48:00',158: '2019-08-04 21:37:00',774: '2019-08-05 09:02:00',163: '2019-08-05 09:52:00',776: '2019-08-05 10:42:00',165: '2019-08-05 19:38:00',778: '2019-08-05 20:29:00',167: '2019-08-05 21:18:00',780: '2019-08-06 08:43:00',169: '2019-08-06 09:33:00',784: '2019-08-06 10:23:00',787: '2019-08-06 10:24:00',173: '2019-08-06 11:13:00',175: '2019-08-06 19:19:00',791: '2019-08-06 20:10:00',179: '2019-08-06 20:59:00',794: '2019-08-06 21:51:00',183: '2019-08-07 09:14:00',798: '2019-08-07 10:05:00',189: '2019-08-07 10:54:00',804: '2019-08-07 19:52:00',197: '2019-08-07 20:40:00',809: '2019-08-07 21:31:00',203: '2019-08-08 08:55:00',815: '2019-08-08 09:47:00',207: '2019-08-08 10:35:00',819: '2019-08-08 11:26:00',821: '2019-08-08 19:33:00',211: '2019-08-08 20:21:00',825: '2019-08-08 21:12:00',215: '2019-08-09 08:35:00',829: '2019-08-09 09:28:00',216: '2019-08-09 10:16:00',830: '2019-08-09 11:07:00',831: '2019-08-09 19:14:00',217: '2019-08-09 20:03:00',832: '2019-08-09 20:53:00',218: '2019-08-09 21:43:00',833: '2019-08-10 09:09:00',219: '2019-08-10 09:57:00',220: '2019-08-10 09:58:00',834: '2019-08-10 10:48:00',221: '2019-08-10 19:45:00',835: '2019-08-10 20:34:00',836: '2019-08-10 20:37:00',837: '2019-08-10 20:39:00',222: '2019-08-10 21:24:00',983: '2019-08-23 06:47:00',342: '2019-08-23 07:36:00',990: '2019-08-23 08:28:00',998: '2019-08-23 18:05:00',350: '2019-08-23 18:55:00',1006: '2019-08-23 19:46:00',358: '2019-08-24 07:17:00',1014: '2019-08-24 08:09:00',370: '2019-08-24 18:36:00',1022: '2019-08-24 19:27:00',378: '2019-08-25 06:58:00',1030: '2019-08-25 07:50:00',380: '2019-08-25 08:38:00',382: '2019-08-25 18:17:00',1032: '2019-08-25 19:08:00',384: '2019-08-25 19:58:00',1034: '2019-08-26 07:31:00',387: '2019-08-26 08:21:00',388: '2019-08-26 17:58:00',1036: '2019-08-26 18:49:00',390: '2019-08-26 19:39:00',1038: '2019-08-27 07:12:00',392: '2019-08-27 08:02:00',1041: '2019-08-27 08:53:00',1042: '2019-08-27 18:30:00',394: '2019-08-27 19:20:00',1044: '2019-08-28 06:53:00',396: '2019-08-28 07:43:00',1046: '2019-08-28 08:34:00',1048: '2019-08-28 18:11:00',398: '2019-08-28 19:01:00',1050: '2019-08-28 19:52:00',407: '2019-08-29 07:24:00',1059: '2019-08-29 08:15:00',416: '2019-08-29 18:42:00',1061: '2019-08-29 19:34:00',433: '2019-08-30 07:05:00',1079: '2019-08-30 07:56:00',448: '2019-08-30 08:44:00',460: '2019-08-30 18:23:00',1100: '2019-08-30 19:15:00',473: '2019-08-30 20:04:00',485: '2019-08-31 06:46:00',1119: '2019-08-31 07:37:00',501: '2019-08-31 08:26:00',520: '2019-08-31 18:04:00',1142: '2019-08-31 18:56:00',543: '2019-08-31 19:45:00',1161: '2019-09-01 07:18:00',560: '2019-09-01 08:08:00',1171: '2019-09-01 08:58:00',1179: '2019-09-01 18:37:00',566: '2019-09-01 19:27:00'},'fire_event_id': {134: 9,746: 9,138: 9,750: 9,753: 9,146: 9,755: 9,147: 9,759: 9,151: 9,763: 9,156: 9,769: 9,772: 9,158: 9,774: 9,163: 9,776: 9,165: 9,778: 9,167: 9,780: 9,169: 9,784: 9,787: 9,173: 9,175: 9,791: 9,179: 9,794: 9,183: 9,798: 9,189: 9,804: 9,197: 9,809: 9,203: 9,815: 9,207: 9,819: 9,821: 9,211: 9,825: 9,215: 9,829: 9,216: 9,830: 9,831: 9,217: 9,832: 9,218: 9,833: 9,219: 9,220: 9,834: 9,221: 9,835: 9,836: 9,837: 9,222: 9,983: 1,342: 1,990: 1,998: 1,350: 1,1006: 1,358: 1,1014: 1,370: 1,1022: 1,378: 1,1030: 1,380: 1,382: 1,1032: 1,384: 1,1034: 1,387: 1,388: 1,1036: 1,390: 1,1038: 1,392: 1,1041: 1,1042: 1,394: 1,1044: 1,396: 1,1046: 1,1048: 1,398: 1,1050: 1,407: 1,1059: 1,416: 1,1061: 1,433: 1,1079: 1,448: 1,460: 1,1100: 1,473: 1,485: 1,1119: 1,501: 1,520: 1,1142: 1,543: 1,1161: 1,560: 1,1171: 1,1179: 1,566: 1},'FRP_agg': {134: 667.6372680664062,746: 647.29248046875,138: 567.4707641601562,750: 849.0169677734375,753: 657.1494750976562,146: 1278.6209716796875,755: 1419.724365234375,147: 2185.70751953125,759: 577.462890625,151: 566.60205078125,763: 223.16781616210938,156: 809.7249145507812,769: 0.0,772: 1269.47705078125,158: 3092.656494140625,774: 266.54608154296875,163: 254.5827178955078,776: 190.95777893066406,165: 814.1732788085938,778: 4077.22509765625,167: 7335.8505859375,780: 120.2669677734375,169: 321.4586181640625,784: 223.98211669921875,787: 5.1728901863098145,173: 126.02513885498047,175: 111.62561798095703,791: 472.8033447265625,179: 655.9718017578125,794: 689.3959350585938,183: 1468.4600830078125,798: 1195.763671875,189: 408.263916015625,804: 1739.2486572265625,197: 3219.11474609375,809: 6170.54150390625,203: 4521.958984375,815: 4032.3623046875,207: 2357.6123046875,819: 675.562744140625,821: 1633.270751953125,211: 1258.2154541015625,825: 1986.114501953125,215: 21.015361785888672,829: 150.6968231201172,216: 0.0,830: 0.0,831: 0.0,217: 0.0,832: 0.0,218: 0.0,833: 0.0,219: 18.392784118652344,220: 0.0,834: 0.0,221: 0.0,835: 0.0,836: 0.0,837: 0.0,222: 39.869178771972656,983: 0.0,342: 0.0,990: 0.0,998: 0.0,350: 55.66621398925781,1006: 36.34450149536133,358: 0.5279630422592163,1014: 0.0,370: 0.0,1022: 0.0,378: 0.0,1030: 0.0,380: 0.0,382: 0.0,1032: 0.0,384: 0.0,1034: 0.0,387: 0.0,388: 0.0,1036: 0.0,390: 0.0,1038: 0.0,392: 0.0,1041: 0.0,1042: 0.0,394: 0.0,1044: 0.0,396: 0.0,1046: 0.0,1048: 0.0,398: 2.0939106941223145,1050: 0.0,407: 0.0,1059: 0.0,416: 0.0,1061: 0.0,433: 0.0,1079: 0.0,448: 0.0,460: 10.144418716430664,1100: 117.81967163085938,473: 81.00775909423828,485: 0.0,1119: 0.0,501: 0.0,520: 270.2600402832031,1142: 116.3390121459961,543: 226.1159210205078,1161: 0.0,560: 0.0,1171: 0.0,1179: 0.0,566: 0.0}}
MASTER Data
master_data = {'datetime': {43: '2019-08-03 21:51:00',44: '2019-08-03 22:20:00',46: '2019-08-04 00:20:00',47: '2019-08-04 00:34:00',48: '2019-08-04 02:40:00',49: '2019-08-06 18:51:00',51: '2019-08-06 20:33:00',52: '2019-08-06 21:53:00',58: '2019-08-07 23:15:00',59: '2019-08-08 00:55:00',60: '2019-08-08 02:37:00',61: '2019-08-09 01:13:00',62: '2019-08-09 01:20:00',63: '2019-08-09 02:04:00',64: '2019-08-09 02:19:00'},'fire_event_id': {43: 9,44: 9,46: 9,47: 9,48: 9,49: 9,51: 9,52: 9,58: 9,59: 9,60: 9,61: 9,62: 9,63: 9,64: 9},'frp_total': {43: 2191.162,44: 896.1034,46: 3383.2078,47: 993.8557,48: 1438.9114,49: 186.78564,51: 415.35806,52: 767.1019,58: 9265.268,59: 12978.829,60: 6355.126,61: 2730.6934,62: 3374.974,63: 1615.5942,64: 1416.4807}}
