Introduction
Data manipulation is one of the most common tasks in data analysis. In many cases, we need to combine two or more datasets to gain new insights into our data. This process is known as merging or joining data frames.
In this tutorial, we will explore how to merge two Pandas DataFrames based on the nearest DateTime using merge_asof() function. This functionality allows us to perform a "fuzzy" merge between two DataFrames on non-exact match keys, by matching the closest key values.
Case study (the Williams Flats Fire)
Consider a real-life scenario: imagine you are employed at organizations like NASA or NOAA. Let's suppose you have gathered observations of a fire event, such as the Williams Flats Fire, using different instruments. These observations are stored in two separate databases with varying dates of observation.:
import pandas as pd
df_master = pd.DataFrame(master_data)
print( df_master )
Output
datetime fire_event_id frp_total
43 2019-08-03 21:51:00 9 2191.16200
44 2019-08-03 22:20:00 9 896.10340
46 2019-08-04 00:20:00 9 3383.20780
47 2019-08-04 00:34:00 9 993.85570
48 2019-08-04 02:40:00 9 1438.91140
49 2019-08-06 18:51:00 9 186.78564
51 2019-08-06 20:33:00 9 415.35806
52 2019-08-06 21:53:00 9 767.10190
58 2019-08-07 23:15:00 9 9265.26800
59 2019-08-08 00:55:00 9 12978.82900
60 2019-08-08 02:37:00 9 6355.12600
61 2019-08-09 01:13:00 9 2730.69340
62 2019-08-09 01:20:00 9 3374.97400
63 2019-08-09 02:04:00 9 1615.59420
64 2019-08-09 02:19:00 9 1416.48070
df_viirs = pd.DataFrame(viirs_data)
print(df_viirs)
Output
datetime fire_event_id FRP_agg
134 2019-08-03 08:49:00 9 667.637268
746 2019-08-03 09:40:00 9 647.292480
138 2019-08-03 10:29:00 9 567.470764
750 2019-08-03 11:20:00 9 849.016968
753 2019-08-03 19:26:00 9 657.149475
... ... ... ...
1161 2019-09-01 07:18:00 1 0.000000
560 2019-09-01 08:08:00 1 0.000000
1171 2019-09-01 08:58:00 1 0.000000
1179 2019-09-01 18:37:00 1 0.000000
566 2019-09-01 19:27:00 1 0.000000
[113 rows x 3 columns]
Now, your goal is to merge these databases, specifically focusing on aligning observations that are closest to each other within a given time frame.
If you would like to replicate this example, I have included the data at the end of this article.
Merging Pandas DataFrames based on the nearest DateTime using merge_asof()
Preparing the data
Before manipulating a dataframe, it's advisable to check the data types of each column using dtypes
df_viirs.dtypes
We observe that the column named "datetime" is not a datetime Python object:
datetime object
fire_event_id int64
FRP_agg float64
dtype: object
To convert it into a datetime, simply use the following code:
df_master['datetime'] = pd.to_datetime( df_master['datetime'] )
df_viirs['datetime'] = pd.to_datetime( df_viirs['datetime'] )
If we check the types again
df_viirs.dtypes
we get
datetime datetime64[ns]
fire_event_id int64
FRP_agg float64
dtype: object
Before merging the two dataframes, it can also be helpful to sort your data by datetime. In our example, the dates are already sorted, but this is not always the case. Failure to do so may result in the following error message:
ValueError: left keys must be sorted
To prevent this, simply sort your data using the following code:
df_master.sort_values(by=['datetime'],inplace=True)
df_viirs.sort_values(by=['datetime'],inplace=True)
Merging the two Pandas DataFrames based on the nearest DateTime
Now that our data is prepared, we can merge our two dataframes. For instance, if we aim to identify observations within a 30-minute range, we can use the following code:
pd.merge_asof(
df_master,
df_viirs, on="datetime", by="fire_event_id", tolerance=pd.Timedelta("30m")
)
This code merges the two dataframes based on the nearest datetime, with the VIIRS observation coming before the master observation:
datetime fire_event_id frp_total FRP_agg
0 2019-08-03 21:51:00 9 2191.16200 NaN
1 2019-08-03 22:20:00 9 896.10340 2185.707520
2 2019-08-04 00:20:00 9 3383.20780 NaN
3 2019-08-04 00:34:00 9 993.85570 NaN
4 2019-08-04 02:40:00 9 1438.91140 NaN
5 2019-08-06 18:51:00 9 186.78564 NaN
6 2019-08-06 20:33:00 9 415.35806 472.803345
7 2019-08-06 21:53:00 9 767.10190 689.395935
8 2019-08-07 23:15:00 9 9265.26800 NaN
9 2019-08-08 00:55:00 9 12978.82900 NaN
10 2019-08-08 02:37:00 9 6355.12600 NaN
11 2019-08-09 01:13:00 9 2730.69340 NaN
12 2019-08-09 01:20:00 9 3374.97400 NaN
13 2019-08-09 02:04:00 9 1615.59420 NaN
14 2019-08-09 02:19:00 9 1416.48070 NaN
Note that:
- We can specify the direction of the merge (forward or backward) using
direction
argument inmerge_asof()
. By default, it performs a forward merge. - If the nearest key value is not found, the corresponding row will have NaN values for all columns from the other DataFrame.
To find an observation within 30 minutes, regardless of whether it's before or after, simply use the "direction" argument with "nearest."
pd.merge_asof(
df_master,
df_viirs, on="datetime", by="fire_event_id", tolerance=pd.Timedelta("30m"), direction="nearest"
)
This code merges the two dataframes based on the nearest datetime:
datetime fire_event_id frp_total FRP_agg
0 2019-08-03 21:51:00 9 2191.16200 2185.707520
1 2019-08-03 22:20:00 9 896.10340 2185.707520
2 2019-08-04 00:20:00 9 3383.20780 NaN
3 2019-08-04 00:34:00 9 993.85570 NaN
4 2019-08-04 02:40:00 9 1438.91140 NaN
5 2019-08-06 18:51:00 9 186.78564 111.625618
6 2019-08-06 20:33:00 9 415.35806 472.803345
7 2019-08-06 21:53:00 9 767.10190 689.395935
8 2019-08-07 23:15:00 9 9265.26800 NaN
9 2019-08-08 00:55:00 9 12978.82900 NaN
10 2019-08-08 02:37:00 9 6355.12600 NaN
11 2019-08-09 01:13:00 9 2730.69340 NaN
12 2019-08-09 01:20:00 9 3374.97400 NaN
13 2019-08-09 02:04:00 9 1615.59420 NaN
14 2019-08-09 02:19:00 9 1416.48070 NaN
References
Links | Site |
---|---|
pandas.merge_asof | pandas.pydata.org |
pandas.DataFrame.dtypes | pandas.pydata.org |
pandas.to_datetime | pandas.pydata.org |
pandas.DataFrame.sort_values | pandas.pydata.org |
Williams Flats Fire Datasets
VIIRS Data
viirs_data = {'datetime': {134: '2019-08-03 08:49:00',
746: '2019-08-03 09:40:00',
138: '2019-08-03 10:29:00',
750: '2019-08-03 11:20:00',
753: '2019-08-03 19:26:00',
146: '2019-08-03 20:16:00',
755: '2019-08-03 21:06:00',
147: '2019-08-03 21:57:00',
759: '2019-08-04 09:21:00',
151: '2019-08-04 10:11:00',
763: '2019-08-04 11:01:00',
156: '2019-08-04 19:57:00',
769: '2019-08-04 20:47:00',
772: '2019-08-04 20:48:00',
158: '2019-08-04 21:37:00',
774: '2019-08-05 09:02:00',
163: '2019-08-05 09:52:00',
776: '2019-08-05 10:42:00',
165: '2019-08-05 19:38:00',
778: '2019-08-05 20:29:00',
167: '2019-08-05 21:18:00',
780: '2019-08-06 08:43:00',
169: '2019-08-06 09:33:00',
784: '2019-08-06 10:23:00',
787: '2019-08-06 10:24:00',
173: '2019-08-06 11:13:00',
175: '2019-08-06 19:19:00',
791: '2019-08-06 20:10:00',
179: '2019-08-06 20:59:00',
794: '2019-08-06 21:51:00',
183: '2019-08-07 09:14:00',
798: '2019-08-07 10:05:00',
189: '2019-08-07 10:54:00',
804: '2019-08-07 19:52:00',
197: '2019-08-07 20:40:00',
809: '2019-08-07 21:31:00',
203: '2019-08-08 08:55:00',
815: '2019-08-08 09:47:00',
207: '2019-08-08 10:35:00',
819: '2019-08-08 11:26:00',
821: '2019-08-08 19:33:00',
211: '2019-08-08 20:21:00',
825: '2019-08-08 21:12:00',
215: '2019-08-09 08:35:00',
829: '2019-08-09 09:28:00',
216: '2019-08-09 10:16:00',
830: '2019-08-09 11:07:00',
831: '2019-08-09 19:14:00',
217: '2019-08-09 20:03:00',
832: '2019-08-09 20:53:00',
218: '2019-08-09 21:43:00',
833: '2019-08-10 09:09:00',
219: '2019-08-10 09:57:00',
220: '2019-08-10 09:58:00',
834: '2019-08-10 10:48:00',
221: '2019-08-10 19:45:00',
835: '2019-08-10 20:34:00',
836: '2019-08-10 20:37:00',
837: '2019-08-10 20:39:00',
222: '2019-08-10 21:24:00',
983: '2019-08-23 06:47:00',
342: '2019-08-23 07:36:00',
990: '2019-08-23 08:28:00',
998: '2019-08-23 18:05:00',
350: '2019-08-23 18:55:00',
1006: '2019-08-23 19:46:00',
358: '2019-08-24 07:17:00',
1014: '2019-08-24 08:09:00',
370: '2019-08-24 18:36:00',
1022: '2019-08-24 19:27:00',
378: '2019-08-25 06:58:00',
1030: '2019-08-25 07:50:00',
380: '2019-08-25 08:38:00',
382: '2019-08-25 18:17:00',
1032: '2019-08-25 19:08:00',
384: '2019-08-25 19:58:00',
1034: '2019-08-26 07:31:00',
387: '2019-08-26 08:21:00',
388: '2019-08-26 17:58:00',
1036: '2019-08-26 18:49:00',
390: '2019-08-26 19:39:00',
1038: '2019-08-27 07:12:00',
392: '2019-08-27 08:02:00',
1041: '2019-08-27 08:53:00',
1042: '2019-08-27 18:30:00',
394: '2019-08-27 19:20:00',
1044: '2019-08-28 06:53:00',
396: '2019-08-28 07:43:00',
1046: '2019-08-28 08:34:00',
1048: '2019-08-28 18:11:00',
398: '2019-08-28 19:01:00',
1050: '2019-08-28 19:52:00',
407: '2019-08-29 07:24:00',
1059: '2019-08-29 08:15:00',
416: '2019-08-29 18:42:00',
1061: '2019-08-29 19:34:00',
433: '2019-08-30 07:05:00',
1079: '2019-08-30 07:56:00',
448: '2019-08-30 08:44:00',
460: '2019-08-30 18:23:00',
1100: '2019-08-30 19:15:00',
473: '2019-08-30 20:04:00',
485: '2019-08-31 06:46:00',
1119: '2019-08-31 07:37:00',
501: '2019-08-31 08:26:00',
520: '2019-08-31 18:04:00',
1142: '2019-08-31 18:56:00',
543: '2019-08-31 19:45:00',
1161: '2019-09-01 07:18:00',
560: '2019-09-01 08:08:00',
1171: '2019-09-01 08:58:00',
1179: '2019-09-01 18:37:00',
566: '2019-09-01 19:27:00'},
'fire_event_id': {134: 9,
746: 9,
138: 9,
750: 9,
753: 9,
146: 9,
755: 9,
147: 9,
759: 9,
151: 9,
763: 9,
156: 9,
769: 9,
772: 9,
158: 9,
774: 9,
163: 9,
776: 9,
165: 9,
778: 9,
167: 9,
780: 9,
169: 9,
784: 9,
787: 9,
173: 9,
175: 9,
791: 9,
179: 9,
794: 9,
183: 9,
798: 9,
189: 9,
804: 9,
197: 9,
809: 9,
203: 9,
815: 9,
207: 9,
819: 9,
821: 9,
211: 9,
825: 9,
215: 9,
829: 9,
216: 9,
830: 9,
831: 9,
217: 9,
832: 9,
218: 9,
833: 9,
219: 9,
220: 9,
834: 9,
221: 9,
835: 9,
836: 9,
837: 9,
222: 9,
983: 1,
342: 1,
990: 1,
998: 1,
350: 1,
1006: 1,
358: 1,
1014: 1,
370: 1,
1022: 1,
378: 1,
1030: 1,
380: 1,
382: 1,
1032: 1,
384: 1,
1034: 1,
387: 1,
388: 1,
1036: 1,
390: 1,
1038: 1,
392: 1,
1041: 1,
1042: 1,
394: 1,
1044: 1,
396: 1,
1046: 1,
1048: 1,
398: 1,
1050: 1,
407: 1,
1059: 1,
416: 1,
1061: 1,
433: 1,
1079: 1,
448: 1,
460: 1,
1100: 1,
473: 1,
485: 1,
1119: 1,
501: 1,
520: 1,
1142: 1,
543: 1,
1161: 1,
560: 1,
1171: 1,
1179: 1,
566: 1},
'FRP_agg': {134: 667.6372680664062,
746: 647.29248046875,
138: 567.4707641601562,
750: 849.0169677734375,
753: 657.1494750976562,
146: 1278.6209716796875,
755: 1419.724365234375,
147: 2185.70751953125,
759: 577.462890625,
151: 566.60205078125,
763: 223.16781616210938,
156: 809.7249145507812,
769: 0.0,
772: 1269.47705078125,
158: 3092.656494140625,
774: 266.54608154296875,
163: 254.5827178955078,
776: 190.95777893066406,
165: 814.1732788085938,
778: 4077.22509765625,
167: 7335.8505859375,
780: 120.2669677734375,
169: 321.4586181640625,
784: 223.98211669921875,
787: 5.1728901863098145,
173: 126.02513885498047,
175: 111.62561798095703,
791: 472.8033447265625,
179: 655.9718017578125,
794: 689.3959350585938,
183: 1468.4600830078125,
798: 1195.763671875,
189: 408.263916015625,
804: 1739.2486572265625,
197: 3219.11474609375,
809: 6170.54150390625,
203: 4521.958984375,
815: 4032.3623046875,
207: 2357.6123046875,
819: 675.562744140625,
821: 1633.270751953125,
211: 1258.2154541015625,
825: 1986.114501953125,
215: 21.015361785888672,
829: 150.6968231201172,
216: 0.0,
830: 0.0,
831: 0.0,
217: 0.0,
832: 0.0,
218: 0.0,
833: 0.0,
219: 18.392784118652344,
220: 0.0,
834: 0.0,
221: 0.0,
835: 0.0,
836: 0.0,
837: 0.0,
222: 39.869178771972656,
983: 0.0,
342: 0.0,
990: 0.0,
998: 0.0,
350: 55.66621398925781,
1006: 36.34450149536133,
358: 0.5279630422592163,
1014: 0.0,
370: 0.0,
1022: 0.0,
378: 0.0,
1030: 0.0,
380: 0.0,
382: 0.0,
1032: 0.0,
384: 0.0,
1034: 0.0,
387: 0.0,
388: 0.0,
1036: 0.0,
390: 0.0,
1038: 0.0,
392: 0.0,
1041: 0.0,
1042: 0.0,
394: 0.0,
1044: 0.0,
396: 0.0,
1046: 0.0,
1048: 0.0,
398: 2.0939106941223145,
1050: 0.0,
407: 0.0,
1059: 0.0,
416: 0.0,
1061: 0.0,
433: 0.0,
1079: 0.0,
448: 0.0,
460: 10.144418716430664,
1100: 117.81967163085938,
473: 81.00775909423828,
485: 0.0,
1119: 0.0,
501: 0.0,
520: 270.2600402832031,
1142: 116.3390121459961,
543: 226.1159210205078,
1161: 0.0,
560: 0.0,
1171: 0.0,
1179: 0.0,
566: 0.0}}
MASTER Data
master_data = {'datetime': {43: '2019-08-03 21:51:00',
44: '2019-08-03 22:20:00',
46: '2019-08-04 00:20:00',
47: '2019-08-04 00:34:00',
48: '2019-08-04 02:40:00',
49: '2019-08-06 18:51:00',
51: '2019-08-06 20:33:00',
52: '2019-08-06 21:53:00',
58: '2019-08-07 23:15:00',
59: '2019-08-08 00:55:00',
60: '2019-08-08 02:37:00',
61: '2019-08-09 01:13:00',
62: '2019-08-09 01:20:00',
63: '2019-08-09 02:04:00',
64: '2019-08-09 02:19:00'},
'fire_event_id': {43: 9,
44: 9,
46: 9,
47: 9,
48: 9,
49: 9,
51: 9,
52: 9,
58: 9,
59: 9,
60: 9,
61: 9,
62: 9,
63: 9,
64: 9},
'frp_total': {43: 2191.162,
44: 896.1034,
46: 3383.2078,
47: 993.8557,
48: 1438.9114,
49: 186.78564,
51: 415.35806,
52: 767.1019,
58: 9265.268,
59: 12978.829,
60: 6355.126,
61: 2730.6934,
62: 3374.974,
63: 1615.5942,
64: 1416.4807}}