How to merge two Pandas DataFrames based on the nearest DateTime ?


Introduction

Data manipulation is one of the most common tasks in data analysis. In many cases, we need to combine two or more datasets to gain new insights into our data. This process is known as merging or joining data frames.

In this tutorial, we will explore how to merge two Pandas DataFrames based on the nearest DateTime using merge_asof() function. This functionality allows us to perform a "fuzzy" merge between two DataFrames on non-exact match keys, by matching the closest key values.

Case study (the Williams Flats Fire)

Consider a real-life scenario: imagine you are employed at organizations like NASA or NOAA. Let's suppose you have gathered observations of a fire event, such as the Williams Flats Fire, using different instruments. These observations are stored in two separate databases with varying dates of observation.:

import pandas as pd

df_master = pd.DataFrame(master_data)

print( df_master )

Output

               datetime  fire_event_id    frp_total
43  2019-08-03 21:51:00              9   2191.16200
44  2019-08-03 22:20:00              9    896.10340
46  2019-08-04 00:20:00              9   3383.20780
47  2019-08-04 00:34:00              9    993.85570
48  2019-08-04 02:40:00              9   1438.91140
49  2019-08-06 18:51:00              9    186.78564
51  2019-08-06 20:33:00              9    415.35806
52  2019-08-06 21:53:00              9    767.10190
58  2019-08-07 23:15:00              9   9265.26800
59  2019-08-08 00:55:00              9  12978.82900
60  2019-08-08 02:37:00              9   6355.12600
61  2019-08-09 01:13:00              9   2730.69340
62  2019-08-09 01:20:00              9   3374.97400
63  2019-08-09 02:04:00              9   1615.59420
64  2019-08-09 02:19:00              9   1416.48070

df_viirs = pd.DataFrame(viirs_data)

print(df_viirs)

Output

                 datetime  fire_event_id     FRP_agg
134   2019-08-03 08:49:00              9  667.637268
746   2019-08-03 09:40:00              9  647.292480
138   2019-08-03 10:29:00              9  567.470764
750   2019-08-03 11:20:00              9  849.016968
753   2019-08-03 19:26:00              9  657.149475
...                   ...            ...         ...
1161  2019-09-01 07:18:00              1    0.000000
560   2019-09-01 08:08:00              1    0.000000
1171  2019-09-01 08:58:00              1    0.000000
1179  2019-09-01 18:37:00              1    0.000000
566   2019-09-01 19:27:00              1    0.000000

[113 rows x 3 columns]

Now, your goal is to merge these databases, specifically focusing on aligning observations that are closest to each other within a given time frame.

If you would like to replicate this example, I have included the data at the end of this article.

Merging Pandas DataFrames based on the nearest DateTime using merge_asof()

Preparing the data

Before manipulating a dataframe, it's advisable to check the data types of each column using dtypes

df_viirs.dtypes

We observe that the column named "datetime" is not a datetime Python object:

datetime          object
fire_event_id      int64
FRP_agg          float64
dtype: object

To convert it into a datetime, simply use the following code:

df_master['datetime'] = pd.to_datetime( df_master['datetime'] )
df_viirs['datetime'] = pd.to_datetime( df_viirs['datetime'] )

If we check the types again

df_viirs.dtypes

we get

datetime         datetime64[ns]
fire_event_id             int64
FRP_agg                 float64
dtype: object

Before merging the two dataframes, it can also be helpful to sort your data by datetime. In our example, the dates are already sorted, but this is not always the case. Failure to do so may result in the following error message:

ValueError: left keys must be sorted

To prevent this, simply sort your data using the following code:

df_master.sort_values(by=['datetime'],inplace=True)
df_viirs.sort_values(by=['datetime'],inplace=True)

Merging the two Pandas DataFrames based on the nearest DateTime

Now that our data is prepared, we can merge our two dataframes. For instance, if we aim to identify observations within a 30-minute range, we can use the following code:

pd.merge_asof(
    df_master, 
    df_viirs, on="datetime", by="fire_event_id", tolerance=pd.Timedelta("30m")
)

This code merges the two dataframes based on the nearest datetime, with the VIIRS observation coming before the master observation:

              datetime  fire_event_id    frp_total      FRP_agg
0  2019-08-03 21:51:00              9   2191.16200          NaN
1  2019-08-03 22:20:00              9    896.10340  2185.707520
2  2019-08-04 00:20:00              9   3383.20780          NaN
3  2019-08-04 00:34:00              9    993.85570          NaN
4  2019-08-04 02:40:00              9   1438.91140          NaN
5  2019-08-06 18:51:00              9    186.78564          NaN
6  2019-08-06 20:33:00              9    415.35806   472.803345
7  2019-08-06 21:53:00              9    767.10190   689.395935
8  2019-08-07 23:15:00              9   9265.26800          NaN
9  2019-08-08 00:55:00              9  12978.82900          NaN
10 2019-08-08 02:37:00              9   6355.12600          NaN
11 2019-08-09 01:13:00              9   2730.69340          NaN
12 2019-08-09 01:20:00              9   3374.97400          NaN
13 2019-08-09 02:04:00              9   1615.59420          NaN
14 2019-08-09 02:19:00              9   1416.48070          NaN

Note that:

  • We can specify the direction of the merge (forward or backward) using direction argument in merge_asof(). By default, it performs a forward merge.
  • If the nearest key value is not found, the corresponding row will have NaN values for all columns from the other DataFrame.

To find an observation within 30 minutes, regardless of whether it's before or after, simply use the "direction" argument with "nearest."

pd.merge_asof(
    df_master, 
    df_viirs, on="datetime", by="fire_event_id", tolerance=pd.Timedelta("30m"), direction="nearest"
)

This code merges the two dataframes based on the nearest datetime:

              datetime  fire_event_id    frp_total      FRP_agg
0  2019-08-03 21:51:00              9   2191.16200  2185.707520
1  2019-08-03 22:20:00              9    896.10340  2185.707520
2  2019-08-04 00:20:00              9   3383.20780          NaN
3  2019-08-04 00:34:00              9    993.85570          NaN
4  2019-08-04 02:40:00              9   1438.91140          NaN
5  2019-08-06 18:51:00              9    186.78564   111.625618
6  2019-08-06 20:33:00              9    415.35806   472.803345
7  2019-08-06 21:53:00              9    767.10190   689.395935
8  2019-08-07 23:15:00              9   9265.26800          NaN
9  2019-08-08 00:55:00              9  12978.82900          NaN
10 2019-08-08 02:37:00              9   6355.12600          NaN
11 2019-08-09 01:13:00              9   2730.69340          NaN
12 2019-08-09 01:20:00              9   3374.97400          NaN
13 2019-08-09 02:04:00              9   1615.59420          NaN
14 2019-08-09 02:19:00              9   1416.48070          NaN

References

Links Site
pandas.merge_asof pandas.pydata.org
pandas.DataFrame.dtypes pandas.pydata.org
pandas.to_datetime pandas.pydata.org
pandas.DataFrame.sort_values pandas.pydata.org

Williams Flats Fire Datasets

VIIRS Data

viirs_data = {'datetime': {134: '2019-08-03 08:49:00',
  746: '2019-08-03 09:40:00',
  138: '2019-08-03 10:29:00',
  750: '2019-08-03 11:20:00',
  753: '2019-08-03 19:26:00',
  146: '2019-08-03 20:16:00',
  755: '2019-08-03 21:06:00',
  147: '2019-08-03 21:57:00',
  759: '2019-08-04 09:21:00',
  151: '2019-08-04 10:11:00',
  763: '2019-08-04 11:01:00',
  156: '2019-08-04 19:57:00',
  769: '2019-08-04 20:47:00',
  772: '2019-08-04 20:48:00',
  158: '2019-08-04 21:37:00',
  774: '2019-08-05 09:02:00',
  163: '2019-08-05 09:52:00',
  776: '2019-08-05 10:42:00',
  165: '2019-08-05 19:38:00',
  778: '2019-08-05 20:29:00',
  167: '2019-08-05 21:18:00',
  780: '2019-08-06 08:43:00',
  169: '2019-08-06 09:33:00',
  784: '2019-08-06 10:23:00',
  787: '2019-08-06 10:24:00',
  173: '2019-08-06 11:13:00',
  175: '2019-08-06 19:19:00',
  791: '2019-08-06 20:10:00',
  179: '2019-08-06 20:59:00',
  794: '2019-08-06 21:51:00',
  183: '2019-08-07 09:14:00',
  798: '2019-08-07 10:05:00',
  189: '2019-08-07 10:54:00',
  804: '2019-08-07 19:52:00',
  197: '2019-08-07 20:40:00',
  809: '2019-08-07 21:31:00',
  203: '2019-08-08 08:55:00',
  815: '2019-08-08 09:47:00',
  207: '2019-08-08 10:35:00',
  819: '2019-08-08 11:26:00',
  821: '2019-08-08 19:33:00',
  211: '2019-08-08 20:21:00',
  825: '2019-08-08 21:12:00',
  215: '2019-08-09 08:35:00',
  829: '2019-08-09 09:28:00',
  216: '2019-08-09 10:16:00',
  830: '2019-08-09 11:07:00',
  831: '2019-08-09 19:14:00',
  217: '2019-08-09 20:03:00',
  832: '2019-08-09 20:53:00',
  218: '2019-08-09 21:43:00',
  833: '2019-08-10 09:09:00',
  219: '2019-08-10 09:57:00',
  220: '2019-08-10 09:58:00',
  834: '2019-08-10 10:48:00',
  221: '2019-08-10 19:45:00',
  835: '2019-08-10 20:34:00',
  836: '2019-08-10 20:37:00',
  837: '2019-08-10 20:39:00',
  222: '2019-08-10 21:24:00',
  983: '2019-08-23 06:47:00',
  342: '2019-08-23 07:36:00',
  990: '2019-08-23 08:28:00',
  998: '2019-08-23 18:05:00',
  350: '2019-08-23 18:55:00',
  1006: '2019-08-23 19:46:00',
  358: '2019-08-24 07:17:00',
  1014: '2019-08-24 08:09:00',
  370: '2019-08-24 18:36:00',
  1022: '2019-08-24 19:27:00',
  378: '2019-08-25 06:58:00',
  1030: '2019-08-25 07:50:00',
  380: '2019-08-25 08:38:00',
  382: '2019-08-25 18:17:00',
  1032: '2019-08-25 19:08:00',
  384: '2019-08-25 19:58:00',
  1034: '2019-08-26 07:31:00',
  387: '2019-08-26 08:21:00',
  388: '2019-08-26 17:58:00',
  1036: '2019-08-26 18:49:00',
  390: '2019-08-26 19:39:00',
  1038: '2019-08-27 07:12:00',
  392: '2019-08-27 08:02:00',
  1041: '2019-08-27 08:53:00',
  1042: '2019-08-27 18:30:00',
  394: '2019-08-27 19:20:00',
  1044: '2019-08-28 06:53:00',
  396: '2019-08-28 07:43:00',
  1046: '2019-08-28 08:34:00',
  1048: '2019-08-28 18:11:00',
  398: '2019-08-28 19:01:00',
  1050: '2019-08-28 19:52:00',
  407: '2019-08-29 07:24:00',
  1059: '2019-08-29 08:15:00',
  416: '2019-08-29 18:42:00',
  1061: '2019-08-29 19:34:00',
  433: '2019-08-30 07:05:00',
  1079: '2019-08-30 07:56:00',
  448: '2019-08-30 08:44:00',
  460: '2019-08-30 18:23:00',
  1100: '2019-08-30 19:15:00',
  473: '2019-08-30 20:04:00',
  485: '2019-08-31 06:46:00',
  1119: '2019-08-31 07:37:00',
  501: '2019-08-31 08:26:00',
  520: '2019-08-31 18:04:00',
  1142: '2019-08-31 18:56:00',
  543: '2019-08-31 19:45:00',
  1161: '2019-09-01 07:18:00',
  560: '2019-09-01 08:08:00',
  1171: '2019-09-01 08:58:00',
  1179: '2019-09-01 18:37:00',
  566: '2019-09-01 19:27:00'},
 'fire_event_id': {134: 9,
  746: 9,
  138: 9,
  750: 9,
  753: 9,
  146: 9,
  755: 9,
  147: 9,
  759: 9,
  151: 9,
  763: 9,
  156: 9,
  769: 9,
  772: 9,
  158: 9,
  774: 9,
  163: 9,
  776: 9,
  165: 9,
  778: 9,
  167: 9,
  780: 9,
  169: 9,
  784: 9,
  787: 9,
  173: 9,
  175: 9,
  791: 9,
  179: 9,
  794: 9,
  183: 9,
  798: 9,
  189: 9,
  804: 9,
  197: 9,
  809: 9,
  203: 9,
  815: 9,
  207: 9,
  819: 9,
  821: 9,
  211: 9,
  825: 9,
  215: 9,
  829: 9,
  216: 9,
  830: 9,
  831: 9,
  217: 9,
  832: 9,
  218: 9,
  833: 9,
  219: 9,
  220: 9,
  834: 9,
  221: 9,
  835: 9,
  836: 9,
  837: 9,
  222: 9,
  983: 1,
  342: 1,
  990: 1,
  998: 1,
  350: 1,
  1006: 1,
  358: 1,
  1014: 1,
  370: 1,
  1022: 1,
  378: 1,
  1030: 1,
  380: 1,
  382: 1,
  1032: 1,
  384: 1,
  1034: 1,
  387: 1,
  388: 1,
  1036: 1,
  390: 1,
  1038: 1,
  392: 1,
  1041: 1,
  1042: 1,
  394: 1,
  1044: 1,
  396: 1,
  1046: 1,
  1048: 1,
  398: 1,
  1050: 1,
  407: 1,
  1059: 1,
  416: 1,
  1061: 1,
  433: 1,
  1079: 1,
  448: 1,
  460: 1,
  1100: 1,
  473: 1,
  485: 1,
  1119: 1,
  501: 1,
  520: 1,
  1142: 1,
  543: 1,
  1161: 1,
  560: 1,
  1171: 1,
  1179: 1,
  566: 1},
 'FRP_agg': {134: 667.6372680664062,
  746: 647.29248046875,
  138: 567.4707641601562,
  750: 849.0169677734375,
  753: 657.1494750976562,
  146: 1278.6209716796875,
  755: 1419.724365234375,
  147: 2185.70751953125,
  759: 577.462890625,
  151: 566.60205078125,
  763: 223.16781616210938,
  156: 809.7249145507812,
  769: 0.0,
  772: 1269.47705078125,
  158: 3092.656494140625,
  774: 266.54608154296875,
  163: 254.5827178955078,
  776: 190.95777893066406,
  165: 814.1732788085938,
  778: 4077.22509765625,
  167: 7335.8505859375,
  780: 120.2669677734375,
  169: 321.4586181640625,
  784: 223.98211669921875,
  787: 5.1728901863098145,
  173: 126.02513885498047,
  175: 111.62561798095703,
  791: 472.8033447265625,
  179: 655.9718017578125,
  794: 689.3959350585938,
  183: 1468.4600830078125,
  798: 1195.763671875,
  189: 408.263916015625,
  804: 1739.2486572265625,
  197: 3219.11474609375,
  809: 6170.54150390625,
  203: 4521.958984375,
  815: 4032.3623046875,
  207: 2357.6123046875,
  819: 675.562744140625,
  821: 1633.270751953125,
  211: 1258.2154541015625,
  825: 1986.114501953125,
  215: 21.015361785888672,
  829: 150.6968231201172,
  216: 0.0,
  830: 0.0,
  831: 0.0,
  217: 0.0,
  832: 0.0,
  218: 0.0,
  833: 0.0,
  219: 18.392784118652344,
  220: 0.0,
  834: 0.0,
  221: 0.0,
  835: 0.0,
  836: 0.0,
  837: 0.0,
  222: 39.869178771972656,
  983: 0.0,
  342: 0.0,
  990: 0.0,
  998: 0.0,
  350: 55.66621398925781,
  1006: 36.34450149536133,
  358: 0.5279630422592163,
  1014: 0.0,
  370: 0.0,
  1022: 0.0,
  378: 0.0,
  1030: 0.0,
  380: 0.0,
  382: 0.0,
  1032: 0.0,
  384: 0.0,
  1034: 0.0,
  387: 0.0,
  388: 0.0,
  1036: 0.0,
  390: 0.0,
  1038: 0.0,
  392: 0.0,
  1041: 0.0,
  1042: 0.0,
  394: 0.0,
  1044: 0.0,
  396: 0.0,
  1046: 0.0,
  1048: 0.0,
  398: 2.0939106941223145,
  1050: 0.0,
  407: 0.0,
  1059: 0.0,
  416: 0.0,
  1061: 0.0,
  433: 0.0,
  1079: 0.0,
  448: 0.0,
  460: 10.144418716430664,
  1100: 117.81967163085938,
  473: 81.00775909423828,
  485: 0.0,
  1119: 0.0,
  501: 0.0,
  520: 270.2600402832031,
  1142: 116.3390121459961,
  543: 226.1159210205078,
  1161: 0.0,
  560: 0.0,
  1171: 0.0,
  1179: 0.0,
  566: 0.0}}

MASTER Data

master_data = {'datetime': {43: '2019-08-03 21:51:00',
  44: '2019-08-03 22:20:00',
  46: '2019-08-04 00:20:00',
  47: '2019-08-04 00:34:00',
  48: '2019-08-04 02:40:00',
  49: '2019-08-06 18:51:00',
  51: '2019-08-06 20:33:00',
  52: '2019-08-06 21:53:00',
  58: '2019-08-07 23:15:00',
  59: '2019-08-08 00:55:00',
  60: '2019-08-08 02:37:00',
  61: '2019-08-09 01:13:00',
  62: '2019-08-09 01:20:00',
  63: '2019-08-09 02:04:00',
  64: '2019-08-09 02:19:00'},
 'fire_event_id': {43: 9,
  44: 9,
  46: 9,
  47: 9,
  48: 9,
  49: 9,
  51: 9,
  52: 9,
  58: 9,
  59: 9,
  60: 9,
  61: 9,
  62: 9,
  63: 9,
  64: 9},
 'frp_total': {43: 2191.162,
  44: 896.1034,
  46: 3383.2078,
  47: 993.8557,
  48: 1438.9114,
  49: 186.78564,
  51: 415.35806,
  52: 767.1019,
  58: 9265.268,
  59: 12978.829,
  60: 6355.126,
  61: 2730.6934,
  62: 3374.974,
  63: 1615.5942,
  64: 1416.4807}}