Categories: pandasPython

Merging and Joining in Pandas

Pandas provide various functionalities for combining separate datasets. In this article, we will look at methods for merging, joining, and concatenating datasets.

Merge Data

For merging data, we use merge() operation. It combines data on common columns or indices. Both the initial datasets need to have a common key column on which the DataFrames would be joined. Let us look at the following examples of merging the DataFrames:

import pandas as pd

record1 = [[‘S1′,’John’,14],[‘S2′,’Maria’,12],[‘S3′,’Tom’,13],[‘S4′,’Adam’,15]]
df1 = pd.DataFrame(record1,columns=[‘S_Id’,’Name’,’Age’])
print(df1)
print()
record2 = [[‘S1’,82.5],[‘S2’,90.0],[‘S3’,77.0],[‘S4’,87.0]]
df2 = pd.DataFrame(record2,columns=[‘S_Id’,’Marks’])
print(df2)

The two DataFrames are:

S_Id Name Age
0 S1 John 14
1 S2 Maria 12
2 S3 Tom 13
3 S4 Adam 15

S_Id Marks
0 S1 82.5
1 S2 90.0
2 S3 77.0
3 S4 87.0

Now to merge these two DataFrames on the common key “S_Id”, we do:

df = pd.merge(df1, df2, on=’S_Id’)

The following merged DataFrame is obtained:

S_Id Name Age Marks
0 S1 John 14 82.5
1 S2 Maria 12 90.0
2 S3 Tom 13 77.0
3 S4 Adam 15 87.0

We can also merge DataFrames on multiple keys. For example, consider the following DataFrames:

record1 = [[‘S1′,’John’,14],[‘S2′,’Maria’,12],[‘S3′,’Tom’,13],[‘S4′,’Adam’,15]]
df1 = pd.DataFrame(record1,columns=[‘S_Id’,’Name’,’Age’])
print(df1)

record2 = [[‘S1’,14,82.5],[‘S2’,13,90.0],[‘S3’,14,77.0],[‘S4’,15,87.0]]
df2 = pd.DataFrame(record2,columns=[‘S_Id’,’Age’,’Marks’])
print(df2)

The DataFrames are:

S_Id Name Age
0 S1 John 14
1 S2 Maria 12
2 S3 Tom 13
3 S4 Adam 15

S_Id Age Marks
0 S1 14 82.5
1 S2 13 90.0
2 S3 14 77.0
3 S4 15 87.0

They have two columns common, i.e., S_Id and Age, but not all the combinations are common. Let us now merge these two DataFrames:

df = pd.merge(df1, df2, on=[‘S_Id’, ‘Age’])

We get:

S_Id Name Age Marks
0 S1 John 14 82.5
1 S4 Adam 15 87.0

Notice that only the S_Id-Age records which are common for both DataFrames have been merged.

The above is an example of Inner Join. Using Pandas, we can also merge datasets using other different types of Joins, i.e., Left Join, Right Join, and Outer Join along with the Inner Join.

Let us visually understand the different forms of Joins:

The first circle, i.e., (a+b) refers to the keys of the 1st DataFrame (df1). The second circle, i.e., (b+c) refers to those of 2nd DataFrame (df2).

The various joins are:

  • Left: (a+b) Use keys from left DataFrame only (i.e., only df1)
  • Right: (b+c) Use keys from right DataFrame only (i.e., only df2)
  • Inner: (b) Use the intersection of keys of both DataFrame only
  • Outer: (a+b+c) Use the union of keys of both DataFrame only

So, if we want the above example to be an Outer join, then modify the code as:

df = pd.merge(df1, df2, how=’outer’, on=[‘S_Id’, ‘Age’])

We get:

S_Id Name Age Marks
0 S1 John 14 82.5
1 S2 Maria 12 NaN
2 S3 Tom 13 NaN
3 S4 Adam 15 87.0
4 S2 NaN 13 90.0
5 S3 NaN 14 77.0

Notice that the key here is the S_Id-Age column combination (i.e., a union of the two) and those values are unique. The missing cells are filled with NaN, as apparent above.

Join Data

Similar to merge, we can also join one DataFrame to other DataFrame which can be differently indexed. The join() function is used for this purpose. By default, this function will attempt to left join the two DataFrames. Let’s look at an example to join the following two DataFrames:

import pandas as pd

record1 = [[‘John’,14],[‘Maria’,12],[‘Tom’,13],[‘Adam’,15]]
df1 = pd.DataFrame(record1,columns=[‘Name’,’Age’])
print(df1)
record2 = [[‘M’,82.5],[‘F’,90.0],[‘M’,77.0],[‘M’,87.0]]
df2 = pd.DataFrame(record2,columns=[‘Gender’,’Marks’])
print(df2)

Following are the two DataFrames:

Name Age
0 John 14
1 Maria 12
2 Tom 13
3 Adam 15

Gender Marks
0 M 82.5
1 F 90.0
2 M 77.0
3 M 87.0

Now, to join df1 with df2, we do:

df = df1.join(df2)

The following joined DataFrame is obtained:

Name Age Gender Marks
0 John 14 M 82.5
1 Maria 12 F 90.0
2 Tom 13 M 77.0
3 Adam 15 M 87.0

We can also change the type of join using the “how = <join type>” similar to that in merge() operation.

Summary

This article focused on Merging and Joining data in Pandas. The next article will focus on concatenating dataframes.

Pallavi Pandey

Recent Posts

MapReduce Algorithm

In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…

9 months ago

Linear Programming using Pyomo

Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…

1 year ago

Networking and Professional Development for Machine Learning Careers in the USA

In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…

1 year ago

Predicting Employee Churn in Python

Analyze employee churn, Why employees are leaving the company, and How to predict, who will…

2 years ago

Airflow Operators

Airflow operators are core components of any workflow defined in airflow. The operator represents a…

2 years ago

MLOps Tutorial

Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…

2 years ago