The most common way to carry out a merge on multiple columns is by using the “merge” function in Python. This function allows you to specify the columns that should be used for the merge, as well as any additional parameters that may be required. To use this function, you must provide the DataFrames that need to be merged and a list of the columns used for the merge. Once this is done, the “merge” function will combine the DataFrames based on those columns. Examples:
Case study
Let's look at a real example. We have multiple DataFrames that provide two columns detailing the respective longitude and latitude, as well as a column with FRP (Fire Radiative Power) measurements for each day. We want to merge all these DataFrames in order to analyze the variation of FRP over time:
Let's generate 3 DataFrame using synthetic data:
import pandas as pd
import numpy as np
d = {'longitude':[-120,-80,-20,40,60],
'latitude':[-10,10,20,40,60],
'FRP_2023_02_01':[random.uniform(0,1) * (100) for i in range(5)]}
df1 = pd.DataFrame(data=d)
print(df1)
d = {'longitude':[-120,-80,-20,40,60],
'latitude':[-10,10,20,-40,-10],
'FRP_2023_02_02':[random.uniform(0,1) * (100) for i in range(5)]}
df2 = pd.DataFrame(data=d)
print(df2)
d = {'longitude':[-20,-80,-20,40,60],
'latitude':[-10,10,20,-40,-10],
'FRP_2023_02_03':[random.uniform(0,1) * (100) for i in range(5)]}
df3 = pd.DataFrame(data=d)
print(df3)
The code displayed above will generate for example
DataFrame df1:
longitude latitude FRP_2023_02_01
0 -120 -10 21.685477
1 -80 10 51.935383
2 -20 20 91.883848
3 40 40 80.616483
4 60 60 3.592461
DataFrame df2:
longitude latitude FRP_2023_02_02
0 -120 -10 6.559510
1 -80 10 64.989729
2 -20 20 77.417787
3 40 -40 67.909856
4 60 -10 14.007092
DataFrame df3:
longitude latitude FRP_2023_02_03
0 -20 -10 21.228505
1 -80 10 45.973927
2 -20 20 42.744729
3 40 -40 3.814573
4 60 -10 34.757743
Merge two DataFrames based on two columns
To merge for example Dataframe df1 and df2 based on longitude and latitude columns:
pd.merge(df1,df2, on=['longitude','latitude'], how='outer')
output
longitude latitude FRP_2023_02_01 FRP_2023_02_02
0 -120 -10 21.685477 6.559510
1 -80 10 51.935383 64.989729
2 -20 20 91.883848 77.417787
3 40 40 80.616483 NaN
4 60 60 3.592461 NaN
5 40 -40 NaN 67.909856
6 60 -10 NaN 14.007092
Note if we use the option 'inner' instead
pd.merge(df1,df2, on=['longitude','latitude'], how='inner')
the code above will generate the following DataFrame
longitude latitude FRP_2023_02_01 FRP_2023_02_02
0 -120 -10 21.685477 6.559510
1 -80 10 51.935383 64.989729
2 -20 20 91.883848 77.417787
Merge multiple DataFrames base on two columns
Now what if we want to merge not only two but multiple DataFrames. We can repeat the code above but another more efficient way to do that is to use the python reduce function from the functools module:
from functools import reduce
df_list = [df1,df2,df3]
columns = ['longitude','latitude']
df_aggregated = reduce(lambda x,y: pd.merge(x,y, on=columns, how='outer'), df_list)
The code above will then generate the following DataFrame:
longitude latitude FRP_2023_02_01 FRP_2023_02_02 FRP_2023_02_03
0 -120 -10 21.685477 6.559510 NaN
1 -80 10 51.935383 64.989729 45.973927
2 -20 20 91.883848 77.417787 42.744729
3 40 40 80.616483 NaN NaN
4 60 60 3.592461 NaN NaN
5 40 -40 NaN 67.909856 3.814573
6 60 -10 NaN 14.007092 34.757743
7 -20 -10 NaN NaN 21.228505
Result if we use the option 'inner':
df_aggregated = reduce(lambda x,y: pd.merge(x,y, on=columns, how='inner'), df_list)
it will generate:
longitude latitude FRP_2023_02_01 FRP_2023_02_02 FRP_2023_02_03
0 -80 10 51.935383 64.989729 45.973927
1 -20 20 91.883848 77.417787 42.744729