Join Two Pandas Dataframe On A Common Column in Python

Pandas dataframes are an effective tool in the Python programming language for handling and manipulating structured data. Moreover, when data is distributed throughout several dataframes and they all share a common column that acts as a key or identifier, merging the dataframes becomes essential. This process allows you to produce a single dataset that offers a comprehensive picture of the data. The purpose of joining two pandas dataframes on a common column in Python is to consolidate and combine related information from different data sources. In this article, we will discuss different methods that can be used to join two pandas dataframe on a common column in python.

If you want to learn more about Python Programming, visit Python Programming Tutorials.

Methods To Join Two Pandas Dataframe

Python offers a variety of methods for integrating dataframes, providing you the option to select the strategy that most closely matches your objectives for data processing. In this section, we will discuss the following methods in detail:

  • Join two pandas dataframe using the merge method
  • Join Two Common Columns Using Concatenate Method
  • using the join method for merging dataframes

To achieve this, we can use the merge() function. The merge function joins a dataframe by using a specified row as the key and an additional column as the value. Additionally, for more complex scenarios involving dataframes with corresponding column types, the concatenate method proves to be a valuable. Another effective method is the join() method which provides further options for dataframe integration based on common columns. Lets discuss these three methods in detail along with examples.

1) Join two Pandas Dataframe using the merge method

One effective method to accomplish this task is to use the merge function. This function allows you to combine the two dataframes into one unified structure. In this way, the shared column is populated with corresponding values from both dataframes.

Syntax: Dataframe.merge(right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)

Following table gives a brief explanation of all the parameters:

ParameterDefault ValueDescription
rightIt specifies the dataframes that you want to merge.
howinnerIt determines the type of join – whether it’s an inner, outer, left, or right join.
onNoneIt’s the common column name where the two DataFrames will be joined. Both DataFrames must have this column.
left_onFalseThis is the column or multiple columns in the left DataFrame used for joining.
right_onFalseIt specifies the column used for joining from the right DataFrame. It can also be an array of multiple columns.
left_indexFalseIf set to True, the function will use the index of the left DataFrame for the join operation. This means that instead of specifying a column to join on, the index values of the left DataFrame will be used.
right_indexFalseSimilarly, if set to True, the function will use the index of the right DataFrame as the basis for the join.
sortFalseIt determines whether the values used to combine two DataFrames will be arranged in alphabetical order. If you set it to True, the combined data will be sorted. If you set it to False, the order will depend on the specific type of join you’re performing.
suffixes ‘_x’, ‘_y’It is a way to label or tag columns with similar names to differentiate them when merging dataframes.
copyTrueIf set to False, the function avoids copying data if possible.
IndicatorFalseIf you set it to True, it creates a new column in the merged DataFrame called “_merge”. This column tells you where each row comes from.
validateNoneIt ensures that the merging process meets certain conditions.

The output of a merge function is a new dataframe which consists of merged values. It does not modify the original dataframes, but create a new dataframe.

Merging Pandas DataFrames Based on Common ID Columns

We can use merge method to create a new combined DataFrame that retains data from both original DataFrames but only for the rows where the IDs match. This process helps consolidate information from different sources while ensuring that only relevant data is included in the merged result. In this example code, we’re using the pd.merge() function to perform this operation and generate a new DataFrame that includes details from both DataFrame 1 and DataFrame 2, where the IDs are the same.

import pandas as pd

# Create DataFrame 1
data1 = {'ID': [1, 2, 3],
         'Name': ['John', 'Mary', 'Alice'],
         'Age': [25, 30, 28]}

df1 = pd.DataFrame(data1)

# Create DataFrame 2
data2 = {'ID': [1, 2, 4],
         'City': ['New York', 'London', 'Berlin'],
         'Salary': [50000, 60000, 45000]}

df2 = pd.DataFrame(data2)

df3=pd.merge(df1, df2)
print("\nDataFrames:")
print(df3)

Output:

DataFrames:
   ID  Name  Age      City  Salary
0   1  John   25  New York   50000
1   2  Mary   30    London   60000

Using the ‘how’ Parameter for Customized DataFrame Merging

When merging multiple Dataframes, you can alter how the data is combined based on the relationship between the common columns by using the ‘how’ parameter. This parameter specifies the kind of join to be used, such as an inner, outer, left, or right join. Depending on which “how” value you choose, you can decide whether to include only the rows that match, include all rows from both DataFrames, or prioritize one DataFrame’s information over the other.

As discussed above in the table, the how parameter describes the type of join operation to be performed between the dataframes. Now, let’s see how changing the value of the how parameter affects the output.

Consider an example in which we have two pandas DataFrames representing student records and course enrollments. The two dataframes have a common column, “Student_ID,” which can be used for merging and analyzing the data.

import pandas as pd

data1 = {'Student_ID': [1, 2, 3, 4, 5],
         'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
         'Age': [18, 19, 20, 18, 21],
         'Grade': ['A', 'B', 'C', 'B', 'A']}

df1 = pd.DataFrame(data1)

data2 = {'Student_ID': [2, 4, 6, 1, 3],
         'Course': ['Math', 'Physics', 'History', 'Chemistry', 'Biology'],
         'Credits': [3, 4, 3, 4, 3]}

df2 = pd.DataFrame(data2)

Now, we will merge these two dataframes and see how the value of ‘how’ parameter affects the output.

Outer Join

All rows from both dataframes are included, and are aligned based on the specified common column(s).

#Outer Join
outer_merging = pd.merge(df1, df2, how="outer")

print(outer_merging)
Output:
Student_ID     Name   Age Grade     Course  Credits
0           1    Alice  18.0     A  Chemistry      4.0
1           2      Bob  19.0     B       Math      3.0
2           3  Charlie  20.0     C    Biology      3.0
3           4    David  18.0     B    Physics      4.0
4           5     Emma  21.0     A        NaN      NaN
5           6      NaN   NaN   NaN    History      3.0

If any data is missing in the common column(s), NaN values are used to fill those gaps in the merged DataFrame.

Inner Join

In inner join, only the rows with matching values in the specified common column(s) will be included in the resulting merged DataFrame.

#Inner Join
inner_merging = pd.merge(df1, df2, how="inner")

print(inner_merging)
Output:
   Student_ID     Name  Age Grade     Course  Credits
0           1    Alice   18     A  Chemistry        4
1           2      Bob   19     B       Math        3
2           3  Charlie   20     C    Biology        3
3           4    David   18     B    Physics        4

Right Join

The right join retains all rows from the right DataFrame and combines them with matching rows from the left DataFrame. If there’s no match in the left DataFrame, it fills the resulting DataFrame with NaN or specified fill values.

#Right Join
right_merging = pd.merge(df1, df2, how="right")

print(right_merging)
Output:
   Student_ID     Name   Age Grade     Course  Credits
0           2      Bob  19.0     B       Math        3
1           4    David  18.0     B    Physics        4
2           6      NaN   NaN   NaN    History        3
3           1    Alice  18.0     A  Chemistry        4
4           3  Charlie  20.0     C    Biology        3

Left Join

This retains all rows from the left DataFrame and combines them with matching rows from the right DataFrame. If there’s no match in the right DataFrame, it fills the resulting DataFrame with NaN or specified fill values.

#Left Join
left_merging = pd.merge(df1, df2, how="left")

print(left_merging)
Output:
   Student_ID     Name  Age Grade     Course  Credits
0           1    Alice   18     A  Chemistry      4.0
1           2      Bob   19     B       Math      3.0
2           3  Charlie   20     C    Biology      3.0
3           4    David   18     B    Physics      4.0
4           5     Emma   21     A        NaN      NaN

Using the ‘on’ parameter to specify column for merging dataframes

When merging two DataFrames, the ‘on’ parameter defines which column(s) will be used as the key to match and merge the two DataFrames. The data from both DataFrames are then combined by aligning rows based on the values in the selected column(s). Lets understand it through an example.

In this example, two Pandas DataFrames, ‘flower’ and ‘test’, are created. The ‘flower’ DataFrame contains information about different types of flowers and their corresponding test attributes. The ‘test’ DataFrame includes flower names and their associated clusters. The goal is to merge these two DataFrames based on the common column ‘flower’.

The pd.merge() function is used to perform the merging operation. The parameters used in the merge function are as follows:

  • flower and test: The DataFrames being merged.
  • on='flower': Specifies that the merging operation should be based on the ‘flower’ column, which is the common key between the DataFrames.
  • how='outer': Indicates that an outer join should be performed.

The resulting merged DataFrame ‘analysis’ contains information from both original DataFrames, where rows are aligned based on the ‘flower’ column. The code is given below:

import pandas as pd

# Create the first DataFrame 'flower'
flower = pd.DataFrame({
    'flower': ['Red Ginger', 'Tree Poppy', 'passion flower', 'water lily'],
    'test': ['similarities', 'accuracy', 'correctness', 'classification']
}, index=[0, 1, 2, 3])

# Create the second DataFrame 'test'
test = pd.DataFrame({
    'flower': ['Red Ginger', 'Tree Poppy', 'rose flower', 'sun flower'],
    'cluster': ['cluster_1', 'cluster_2', 'cluster_3', 'cluster_4']
}, index=[4, 5, 6, 7])

# Merge the two DataFrames using the 'flower' column as the key
analysis = pd.merge(flower, test, on='flower', how='outer')

# Display the merged DataFrame 'analysis'
print(analysis)
Output:
           flower            test    cluster
0      Red Ginger    similarities  cluster_1
1      Tree Poppy        accuracy  cluster_2
2  passion flower     correctness        NaN
3      water lily  classification        NaN
4     rose flower             NaN  cluster_3
5      sun flower             NaN  cluster_4

The main function of pd.merge() is to combine DataFrames based on common columns or keys. It’s particularly useful when you want to merge data from different sources while aligning them based on specific columns.

2) Join two common columns using the concatenate method

The pd.concat() method in pandas is used to concatenate multiple DataFrames along a specified axis. It allows you to combine DataFrames vertically (stacking) or horizontally (side-by-side) based on their indexes or columns. This method is helpful when you want to merge data from multiple sources or create a larger DataFrame by combining smaller ones.

The pd.concat() has an axis parameter which specify the axis along which the concatenation occurs (0 for rows, 1 for columns). By default, the axis parameter is set to 0. Lets see through an example how pd.concat() function works.

import pandas as pd

# Create DataFrame 1: Employee Data
data_employee = {'EmployeeID': [101, 102, 103],
                 'Name': ['John', 'Alice', 'Bob'],
                 'Position': ['Manager', 'Engineer', 'Analyst']}

df_employee = pd.DataFrame(data_employee)

# Create DataFrame 2: Employee Projects
data_projects = {
                 'Project': ['Website Redesign', 'Financial Analysis', 'Product Development'],
                 'StartYear': [2022, 2023, 2022],
                 'Status': ['Completed', 'Ongoing', 'Pending']}

df_projects = pd.DataFrame(data_projects)

# Concatenate the two DataFrames horizontally
concatenated_df = pd.concat([df_employee, df_projects], axis=1)

print(concatenated_df)
Output:
   EmployeeID   Name  Position              Project  StartYear     Status
0         101   John   Manager     Website Redesign       2022  Completed
1         102  Alice  Engineer   Financial Analysis       2023    Ongoing
2         103    Bob   Analyst  Product Development       2022    Pending

In this example, pd.concat() is used to horizontally concatenate the two DataFrames (df_employee and df_projects) along the columns. The resulting DataFrame, concatenated_df, combines employee information and project details, creating a more comprehensive dataset for analysis and reporting. The axis=1 parameter indicates that concatenation should be performed along the columns.

You can also specify additional options like ignore_index to reset the index of the resulting DataFrame or keys to create a hierarchical index. Here’s an example of how you can use the pd.concat() method to combine two DataFrames vertically (along rows):

import pandas as pd

# Create two DataFrames
data1 = {'Name': ['John', 'Mary', 'Alice'],
         'Age': [25, 30, 28]}
df1 = pd.DataFrame(data1)

data2 = {'Name': ['Bob', 'Emma', 'David'],
         'Age': [22, 27, 29]}
df2 = pd.DataFrame(data2)

# Concatenate DataFrames vertically
combined_df = pd.concat([df1, df2])

print("Combined DataFrame:")
print(combined_df)
Output:
Combined DataFrame:
    Name  Age
0   John   25
1   Mary   30
2  Alice   28
0    Bob   22
1   Emma   27
2  David   29

From the above example, you can see that the indices are repeated in the concatenated DataFrame. Each row retains its original index value from the source DataFrames, resulting in repeated index values in the concatenated DataFrame. This behavior might not be desirable in cases where a consistent and unique index is preferred for the concatenated DataFrame. To address this, you can set ignore_index to True, which will reset the index and provide a new index sequence to the concatenated DataFrame, ensuring index consistency and uniqueness. Here’s the code:

# Concatenate DataFrames vertically
combined_df = pd.concat([df1, df2], ignore_index=True)

print("Combined DataFrame:")
print(combined_df)
Output:
Combined DataFrame:
    Name  Age
0   John   25
1   Mary   30
2  Alice   28
3    Bob   22
4   Emma   27
5  David   29

The concatenation method is particularly useful when you want to combine datasets from multiple sources, create comprehensive reports, or conduct analysis on diverse datasets.

3) Using Join() Method To Concatenate A Common Column In Python

The join() method allows you to combine data from different DataFrames by aligning rows with matching values in the specified column. It simplifies the process of merging data and can be especially useful when working with related datasets that share a common identifier. The following example demonstrates how you can use join() function:

import pandas as pd
col_1 = pd.read_csv("ICC MEN'S CRICKET WORLD CUP HIGH SCORES.csv")
col_2 = pd.read_csv("ICC MEN'S HIGH SCORES.csv")
print ('col_1:', '\n', col_1)
print('\n ')
print ('col_2:', '\n', col_2)
Output:
col_1: 
            Player  Runs   4s  6s  Unnamed: 4          Team      Opposition
0    Fakhar Zaman   193  155  10         NaN      Pakistan  v South Africa
1        JN Malan  177*  169  16         NaN  South Africa       v Ireland
2      JC Buttler  162*   70  14         NaN       England   v Netherlands
3  Ibrahim Zadran   162  138   4         NaN   Afghanistan     v Sri Lanka
4      Babar Azam   158  139   4         NaN      Pakistan       v England

 
col_2: 
            Player  Runs   Balls
0    Fakhar Zaman   193     155
1        JN Malan  177*     169
2      JC Buttler  162*      70
3  Ibrahim Zadran   162     138
4      Babar Azam   158     139
result = col_2.join(col_1,lsuffix="_x")
result
Output:

Conclusion

In the context of this tutorial, we have explored how to merge two pandas dataframes based on a shared column using the merge() function. This approach enables the combination of data from separate dataframes by utilizing a specified column as a key and an accompanying column as a value. For more complex scenarios involving dataframes with corresponding column types, the concat() method proves to be an essential tool. It’s worth noting that while the versatile concat() method can also combine dataframes by columns, the merge() function stands as the recommended and optimal choice for joining dataframes on a common column in Python.

Leave a Comment

Your email address will not be published. Required fields are marked *