How to Merge Pandas DataFrames

Data merge is a common data processing activity. Learn how Pandas provide various ways to merge our data.



How to Merge Pandas DataFrames
Image by catalyststuff on Freepik

 

Data merging between two datasets or more is typical during data processing. In this blog, we will learn how data merging with Pandas is done and various tips to improve our data merging skills. Let’s explore the data merge technique.

 

Merge Pandas DataFrame

 

First; we need to import the Pandas Python package.

import pandas as pd

 

Merging two Pandas DataFrames would require the merge method from the Pandas package. This function would merge two DataFrame by the variable or columns we intended to join. Let’s try the Pandas merging method with an example DataFrame.

# Create Population DataFrame
df1 = pd.DataFrame({
    'Country': ['America', 'Indonesia', 'France'],
    'Location': ['New York', 'Jakarta', 'Paris'],
    'Population': [731800, 575030, 183305]
})

# Create Income DataFrame
df2 = pd.DataFrame({
    'Country': ['America', 'America', 'Indonesia', 'India', 'France', 'Greece'],
    'Location': ['New York', 'Chicago', 'Jakarta', 'Mumbai', 'Paris', 'Yunani'],
    'Income': [1000, 1500, 1400, 1100, 900, 1200]
    
})

# Merge Dataframe
merged_df = pd.merge(df1, df2, on='Country')
merged_df

 

How to Merge Pandas DataFrames
Image by Author

 

In the example above, we create two different DataFrame with slightly different columns, and we merge them on the ‘Country’ column. The result is the rows from both DataFrame with similar values were merged. With one line, we manage to merge two different DataFrame. 

 

Applying Optional Parameters 

 

Pandas .merge method has various optional parameters we could take advantage of. Let’s take a look at some of the useful ones.

 

Renaming merged columns with the same name

 

In our example above, we can see a column called ‘Location’ which contains suffixes _x and _y after the merge. If we want to change the column name when the merge happens, we can do that with the following code.

merged_df = pd.merge(df1, df2, on='Country', suffixes = ('_Population', '_Income'))
merged_df

 

How to Merge Pandas DataFrames
Image by Author

 

In this code, we pass the suffixes parameter with tuple contain two values; the first and second DataFrame name. In my example, we named the first DataFrame Population and the second Income.

 

Merge based on different column names

 

What if we have two DataFrame with two different column names that refer to the same definition? We can still merge them, but we need to specify which DataFrame and column we want to merge. 

df2 = pd.DataFrame({
    'Index': ['America', 'America', 'Indonesia', 'India', 'France', 'Greece'],
    'Location': ['New York', 'Chicago', 'Jakarta', 'Mumbai', 'Paris', 'Yunani'],
    'Income': [1000, 1500, 1400, 1100, 900, 1200]
    
})

merged_df = pd.merge(df1, df2, left_on='Country', right_on = 'Index')
merged_df

 

How to Merge Pandas DataFrames
Image by Author

 

In the example above, we change the second DataFrame ‘Country’ column as ‘Index’ then we merge the dataset by specify the column name on each DataFrame. Left_on parameter is for the first DataFrame and the right_on for the second DataFrame.

 

Change the merge type

 

There are five different types merged in the Pandas merge method. By default, the merge is an Inner merge which only includes rows with matching values in both columns.  However, we can change the merge type by passing the values on the how parameter:

  1. Left

Left merge uses only the values from the first DataFrame.

merged_df = pd.merge(df1, df2, on='Country', how = 'left')
merged_df

 

How to Merge Pandas DataFrames
Image by Author

 

  1. Right

Right merge uses only the values from the second DataFrame.

merged_df = pd.merge(df1, df2, on='Country', how = 'right')
merged_df

 

How to Merge Pandas DataFrames
Image by Author

 

  1. Outer

Include all rows from both DataFrame by using the union of both DataFrame keys.

How to Merge Pandas DataFrames
Image by Author

 

  1. Cross

Create a cartesian product from both DataFrame

merged_df = pd.merge(df1, df2, how = 'cross')
merged_df

 

How to Merge Pandas DataFrames
Image by Author

 

Merge by two or more different columns

 

It is possible with Pandas merge to join two DataFrame with two or more different columns. We need to specify within the ‘on’ parameter by passing the list of columns we want to merge.

merged_df = pd.merge(df1, df2, on = ['Country', 'Location'])
merged_df

 

How to Merge Pandas DataFrames
Image by Author

 

Conclusion

 

During data processing, it’s a common activity to merge two different DataFrame. To do that, we can use the Pandas method called merge. There are various optional parameters we can access within the Pandas merge to perform specific tasks, including changing the merged column name, merging DataFrame based on the different column name, changing the merge type, and merging by two other columns or more.

 
 
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and Data tips via social media and writing media.