How to merge multiple Pandas DataFrames based on multiple columns ?

Published: February 21, 2023

Tags: Python; Pandas; Dataframe;

DMCA.com Protection Status

The most common way to carry out a merge on multiple columns is by using the “merge” function in Python. This function allows you to specify the columns that should be used for the merge, as well as any additional parameters that may be required. To use this function, you must provide the DataFrames that need to be merged and a list of the columns used for the merge. Once this is done, the “merge” function will combine the DataFrames based on those columns. Examples:

Case study

Let's look at a real example. We have multiple DataFrames that provide two columns detailing the respective longitude and latitude, as well as a column with FRP (Fire Radiative Power) measurements for each day. We want to merge all these DataFrames in order to analyze the variation of FRP over time:

Let's generate 3 DataFrame using synthetic data:

import pandas as pd
import numpy as np

d = {'longitude':[-120,-80,-20,40,60], 
     'latitude':[-10,10,20,40,60],
     'FRP_2023_02_01':[random.uniform(0,1) * (100) for i in range(5)]}

df1 = pd.DataFrame(data=d)

print(df1)

d = {'longitude':[-120,-80,-20,40,60], 
     'latitude':[-10,10,20,-40,-10],
     'FRP_2023_02_02':[random.uniform(0,1) * (100) for i in range(5)]}

df2 = pd.DataFrame(data=d)

print(df2)

d = {'longitude':[-20,-80,-20,40,60], 
     'latitude':[-10,10,20,-40,-10],
     'FRP_2023_02_03':[random.uniform(0,1) * (100) for i in range(5)]}

df3 = pd.DataFrame(data=d)

print(df3)

The code displayed above will generate for example

DataFrame df1:

   longitude  latitude  FRP_2023_02_01
0       -120       -10       21.685477
1        -80        10       51.935383
2        -20        20       91.883848
3         40        40       80.616483
4         60        60        3.592461

DataFrame df2:

   longitude  latitude  FRP_2023_02_02
0       -120       -10        6.559510
1        -80        10       64.989729
2        -20        20       77.417787
3         40       -40       67.909856
4         60       -10       14.007092

DataFrame df3:

   longitude  latitude  FRP_2023_02_03
0        -20       -10       21.228505
1        -80        10       45.973927
2        -20        20       42.744729
3         40       -40        3.814573
4         60       -10       34.757743

Merge two DataFrames based on two columns

To merge for example Dataframe df1 and df2 based on longitude and latitude columns:

pd.merge(df1,df2, on=['longitude','latitude'], how='outer')

output

   longitude  latitude  FRP_2023_02_01  FRP_2023_02_02
0       -120       -10       21.685477        6.559510
1        -80        10       51.935383       64.989729
2        -20        20       91.883848       77.417787
3         40        40       80.616483             NaN
4         60        60        3.592461             NaN
5         40       -40             NaN       67.909856
6         60       -10             NaN       14.007092

Note if we use the option 'inner' instead

pd.merge(df1,df2, on=['longitude','latitude'], how='inner')

the code above will generate the following DataFrame

   longitude  latitude  FRP_2023_02_01  FRP_2023_02_02
0       -120       -10       21.685477        6.559510
1        -80        10       51.935383       64.989729
2        -20        20       91.883848       77.417787

Merge multiple DataFrames base on two columns

Now what if we want to merge not only two but multiple DataFrames. We can repeat the code above but another more efficient way to do that is to use the python reduce function from the functools module:

from functools import reduce

df_list = [df1,df2,df3]
columns = ['longitude','latitude']

df_aggregated = reduce(lambda x,y: pd.merge(x,y, on=columns, how='outer'), df_list)

The code above will then generate the following DataFrame:

   longitude  latitude  FRP_2023_02_01  FRP_2023_02_02  FRP_2023_02_03
0       -120       -10       21.685477        6.559510             NaN
1        -80        10       51.935383       64.989729       45.973927
2        -20        20       91.883848       77.417787       42.744729
3         40        40       80.616483             NaN             NaN
4         60        60        3.592461             NaN             NaN
5         40       -40             NaN       67.909856        3.814573
6         60       -10             NaN       14.007092       34.757743
7        -20       -10             NaN             NaN       21.228505

Result if we use the option 'inner':

df_aggregated = reduce(lambda x,y: pd.merge(x,y, on=columns, how='inner'), df_list)

it will generate:

   longitude  latitude  FRP_2023_02_01  FRP_2023_02_02  FRP_2023_02_03
0        -80        10       51.935383       64.989729       45.973927
1        -20        20       91.883848       77.417787       42.744729

References

Links Site
merge() pandas.pydata.org
reduce() docs.python.org