The most common way to carry out a merge on multiple columns is by using the “merge” function in Python. This function allows you to specify the columns that should be used for the merge, as well as any additional parameters that may be required. To use this function, you must provide the DataFrames that need to be merged and a list of the columns used for the merge. Once this is done, the “merge” function will combine the DataFrames based on those columns. Examples:
Case study
Let's look at a real example. We have multiple DataFrames that provide two columns detailing the respective longitude and latitude, as well as a column with FRP (Fire Radiative Power) measurements for each day. We want to merge all these DataFrames in order to analyze the variation of FRP over time:
Let's generate 3 DataFrame using synthetic data:
import pandas as pdimport numpy as npd = {'longitude':[-120,-80,-20,40,60],'latitude':[-10,10,20,40,60],'FRP_2023_02_01':[random.uniform(0,1) * (100) for i in range(5)]}df1 = pd.DataFrame(data=d)print(df1)d = {'longitude':[-120,-80,-20,40,60],'latitude':[-10,10,20,-40,-10],'FRP_2023_02_02':[random.uniform(0,1) * (100) for i in range(5)]}df2 = pd.DataFrame(data=d)print(df2)d = {'longitude':[-20,-80,-20,40,60],'latitude':[-10,10,20,-40,-10],'FRP_2023_02_03':[random.uniform(0,1) * (100) for i in range(5)]}df3 = pd.DataFrame(data=d)print(df3)
The code displayed above will generate for example
DataFrame df1:
longitude latitude FRP_2023_02_010 -120 -10 21.6854771 -80 10 51.9353832 -20 20 91.8838483 40 40 80.6164834 60 60 3.592461
DataFrame df2:
longitude latitude FRP_2023_02_020 -120 -10 6.5595101 -80 10 64.9897292 -20 20 77.4177873 40 -40 67.9098564 60 -10 14.007092
DataFrame df3:
longitude latitude FRP_2023_02_030 -20 -10 21.2285051 -80 10 45.9739272 -20 20 42.7447293 40 -40 3.8145734 60 -10 34.757743
Merge two DataFrames based on two columns
To merge for example Dataframe df1 and df2 based on longitude and latitude columns:
pd.merge(df1,df2, on=['longitude','latitude'], how='outer')
output
longitude latitude FRP_2023_02_01 FRP_2023_02_020 -120 -10 21.685477 6.5595101 -80 10 51.935383 64.9897292 -20 20 91.883848 77.4177873 40 40 80.616483 NaN4 60 60 3.592461 NaN5 40 -40 NaN 67.9098566 60 -10 NaN 14.007092
Note if we use the option 'inner' instead
pd.merge(df1,df2, on=['longitude','latitude'], how='inner')
the code above will generate the following DataFrame
longitude latitude FRP_2023_02_01 FRP_2023_02_020 -120 -10 21.685477 6.5595101 -80 10 51.935383 64.9897292 -20 20 91.883848 77.417787
Merge multiple DataFrames base on two columns
Now what if we want to merge not only two but multiple DataFrames. We can repeat the code above but another more efficient way to do that is to use the python reduce function from the functools module:
from functools import reducedf_list = [df1,df2,df3]columns = ['longitude','latitude']df_aggregated = reduce(lambda x,y: pd.merge(x,y, on=columns, how='outer'), df_list)
The code above will then generate the following DataFrame:
longitude latitude FRP_2023_02_01 FRP_2023_02_02 FRP_2023_02_030 -120 -10 21.685477 6.559510 NaN1 -80 10 51.935383 64.989729 45.9739272 -20 20 91.883848 77.417787 42.7447293 40 40 80.616483 NaN NaN4 60 60 3.592461 NaN NaN5 40 -40 NaN 67.909856 3.8145736 60 -10 NaN 14.007092 34.7577437 -20 -10 NaN NaN 21.228505
Result if we use the option 'inner':
df_aggregated = reduce(lambda x,y: pd.merge(x,y, on=columns, how='inner'), df_list)
it will generate:
longitude latitude FRP_2023_02_01 FRP_2023_02_02 FRP_2023_02_030 -80 10 51.935383 64.989729 45.9739271 -20 20 91.883848 77.417787 42.744729
