Merge in Python Pandas with examples

Joins in Python Pandas

In Pandas, merge is a function that allows you to combine two or more data frames based on one or more common columns, similar to a SQL join.

Here’s an example of how to use merge in Pandas:

import pandas as pd

# create two data frames to merge
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': [5, 6, 7, 8]})

# merge data frames based on the 'key' column
merged_df = pd.merge(df1, df2, on='key')

# print the merged data frame
print(merged_df)

Output:

key  value_x  value_y
0   B        2        5
1   D        4        6

In this example, we create two data frames df1 and df2 with a common column key. We then use the merge function to merge the two data frames based on the common column key. The resulting data frame merged_df contains only the rows where the key appears in both data frames, and includes columns from both data frames (value_x from df1 and value_y from df2).

Here are a few more examples to illustrate the different ways in which merge can be used:

  1. Merge two data frames with different column names:
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value1': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value2': [5, 6, 7, 8]})
merged_df = pd.merge(df1, df2, on='key')
print(merged_df)

Output:

key  value1  value2
0   B       2       5
1   D       4       6
  1. Merge two data frames using multiple columns:
df1 = pd.DataFrame({'key1': ['A', 'B', 'C', 'D'], 'key2': ['W', 'X', 'Y', 'Z'], 'value': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key1': ['B', 'D', 'E', 'F'], 'key2': ['X', 'Z', 'Y', 'W'], 'value': [5, 6, 7, 8]})
merged_df = pd.merge(df1, df2, on=['key1', 'key2'])
print(merged_df)

Output:

key1 key2  value_x  value_y
0    D    Z        4        6
  1. Merge two data frames with different types of join (inner join, left join, right join, and outer join):
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value1': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value2': [5, 6, 7, 8]})

# inner join
inner_join_df = pd.merge(df1, df2, on='key', how='inner')
left_join_df = pd.merge(df1, df2, on='key', how='left')
right_join_df = pd.merge(df1, df2, on='key', how='right')
outer_join_df = pd.merge(df1, df2, on='key', how='outer')

Here are a few more examples of using the merge function in Pandas:

  1. Merge two data frames with different column names and fill missing values with zeros:
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value1': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value2': [5, 6, 7, 8]})
merged_df = pd.merge(df1, df2, left_on='key', right_on='key', how='outer').fillna(0)
print(merged_df)

Output:

key  value1  value2
0   A     1.0     0.0
1   B     2.0     5.0
2   C     3.0     0.0
3   D     4.0     6.0
4   E     0.0     7.0
5   F     0.0     8.0

In this example, we use left_on and right_on parameters to specify the column names for merging instead of using the on parameter. We also use the outer join type to include all rows from both data frames in the merged result, and fill missing values with zeros using the fillna method.

  1. Merge two data frames with different column names and rename the columns:
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value1': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value2': [5, 6, 7, 8]})
merged_df = pd.merge(df1, df2, left_on='key', right_on='key').rename(columns={'value1': 'df1_value', 'value2': 'df2_value'})
print(merged_df)

Output:

key  df1_value  df2_value
0   B          2          5
1   D          4          6

In this example, we use the rename method to rename the columns of the merged result.

Merge Complete Explanation with Example ( Jupyter notebook Code )

Basic Merge Using a Dataframe Column

In [28]:

import pandas as pd
df1 = pd.DataFrame({
    "city": ["new york","chicago","orlando"],
    "temperature": [21,14,35],
})
df1

Out[28]:

citytemperature
0new york21
1chicago14
2orlando35

In [29]:

df2 = pd.DataFrame({
    "city": ["chicago","new york","orlando"],
    "humidity": [65,68,75],
})
df2

Out[29]:

cityhumidity
0chicago65
1new york68
2orlando75

In [30]:

df3 = pd.merge(df1, df2, on="city")
df3

Out[30]:

citytemperaturehumidity
0new york2168
1chicago1465
2orlando3575

Type Of DataBase Joins

<img src=”db_joins.jpg” height=”800″, width=”800″>

In [31]:

df1 = pd.DataFrame({
    "city": ["new york","chicago","orlando", "baltimore"],
    "temperature": [21,14,35, 38],
})
df1

Out[31]:

citytemperature
0new york21
1chicago14
2orlando35
3baltimore38

In [32]:

df2 = pd.DataFrame({
    "city": ["chicago","new york","san diego"],
    "humidity": [65,68,71],
})
df2

Out[32]:

cityhumidity
0chicago65
1new york68
2san diego71

In [33]:

df3=pd.merge(df1,df2,on="city",how="inner")
df3

Out[33]:

citytemperaturehumidity
0new york2168
1chicago1465

In [34]:

df3=pd.merge(df1,df2,on="city",how="outer")
df3

Out[34]:

citytemperaturehumidity
0new york21.068.0
1chicago14.065.0
2orlando35.0NaN
3baltimore38.0NaN
4san diegoNaN71.0

In [35]:

df3=pd.merge(df1,df2,on="city",how="left")
df3

Out[35]:

citytemperaturehumidity
0new york2168.0
1chicago1465.0
2orlando35NaN
3baltimore38NaN

In [36]:

df3=pd.merge(df1,df2,on="city",how="right")
df3

Out[36]:

citytemperaturehumidity
0new york21.068
1chicago14.065
2san diegoNaN71

indicator flag

In [37]:

df3=pd.merge(df1,df2,on="city",how="outer",indicator=True)
df3

Out[37]:

citytemperaturehumidity_merge
0new york21.068.0both
1chicago14.065.0both
2orlando35.0NaNleft_only
3baltimore38.0NaNleft_only
4san diegoNaN71.0right_only

suffixes

In [38]:

df1 = pd.DataFrame({
    "city": ["new york","chicago","orlando", "baltimore"],
    "temperature": [21,14,35,38],
    "humidity": [65,68,71, 75]
})
df1

Out[38]:

cityhumiditytemperature
0new york6521
1chicago6814
2orlando7135
3baltimore7538

In [39]:

df2 = pd.DataFrame({
    "city": ["chicago","new york","san diego"],
    "temperature": [21,14,35],
    "humidity": [65,68,71]
})
df2

Out[39]:

cityhumiditytemperature
0chicago6521
1new york6814
2san diego7135

In [40]:

df3= pd.merge(df1,df2,on="city",how="outer", suffixes=('_first','_second'))
df3

Out[40]:

cityhumidity_firsttemperature_firsthumidity_secondtemperature_second
0new york65.021.068.014.0
1chicago68.014.065.021.0
2orlando71.035.0NaNNaN
3baltimore75.038.0NaNNaN
4san diegoNaNNaN71.035.0

join

In [58]:

df1 = pd.DataFrame({
    "city": ["new york","chicago","orlando"],
    "temperature": [21,14,35],
})
df1.set_index('city',inplace=True)
df1

Out[58]:

temperature
city
new york21
chicago14
orlando35

In [59]:

df2 = pd.DataFrame({
    "city": ["chicago","new york","orlando"],
    "humidity": [65,68,75],
})
df2.set_index('city',inplace=True)
df2

Out[59]:

humidity
city
chicago65
new york68
orlando75

In [60]:

df1.join(df2,lsuffix='_l', rsuffix='_r')

Out[60]:

temperaturehumidity
city
new york2168
chicago1465
orlando3575

Leave a Comment