In Pandas, merge
is a function that allows you to combine two or more data frames based on one or more common columns, similar to a SQL join.
Here’s an example of how to use merge
in Pandas:
import pandas as pd
# create two data frames to merge
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': [5, 6, 7, 8]})
# merge data frames based on the 'key' column
merged_df = pd.merge(df1, df2, on='key')
# print the merged data frame
print(merged_df)
Output:
key value_x value_y
0 B 2 5
1 D 4 6
In this example, we create two data frames df1
and df2
with a common column key
. We then use the merge
function to merge the two data frames based on the common column key
. The resulting data frame merged_df
contains only the rows where the key appears in both data frames, and includes columns from both data frames (value_x
from df1
and value_y
from df2
).
Here are a few more examples to illustrate the different ways in which merge
can be used:
- Merge two data frames with different column names:
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value1': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value2': [5, 6, 7, 8]})
merged_df = pd.merge(df1, df2, on='key')
print(merged_df)
Output:
key value1 value2
0 B 2 5
1 D 4 6
- Merge two data frames using multiple columns:
df1 = pd.DataFrame({'key1': ['A', 'B', 'C', 'D'], 'key2': ['W', 'X', 'Y', 'Z'], 'value': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key1': ['B', 'D', 'E', 'F'], 'key2': ['X', 'Z', 'Y', 'W'], 'value': [5, 6, 7, 8]})
merged_df = pd.merge(df1, df2, on=['key1', 'key2'])
print(merged_df)
Output:
key1 key2 value_x value_y
0 D Z 4 6
- Merge two data frames with different types of join (inner join, left join, right join, and outer join):
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value1': [1, 2, 3, 4]}) df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value2': [5, 6, 7, 8]}) # inner join inner_join_df = pd.merge(df1, df2, on='key', how='inner')
left_join_df = pd.merge(df1, df2, on='key', how='left')
right_join_df = pd.merge(df1, df2, on='key', how='right')
outer_join_df = pd.merge(df1, df2, on='key', how='outer')
Here are a few more examples of using the merge
function in Pandas:
- Merge two data frames with different column names and fill missing values with zeros:
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value1': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value2': [5, 6, 7, 8]})
merged_df = pd.merge(df1, df2, left_on='key', right_on='key', how='outer').fillna(0)
print(merged_df)
Output:
key value1 value2
0 A 1.0 0.0
1 B 2.0 5.0
2 C 3.0 0.0
3 D 4.0 6.0
4 E 0.0 7.0
5 F 0.0 8.0
In this example, we use left_on
and right_on
parameters to specify the column names for merging instead of using the on
parameter. We also use the outer
join type to include all rows from both data frames in the merged result, and fill missing values with zeros using the fillna
method.
- Merge two data frames with different column names and rename the columns:
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value1': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value2': [5, 6, 7, 8]})
merged_df = pd.merge(df1, df2, left_on='key', right_on='key').rename(columns={'value1': 'df1_value', 'value2': 'df2_value'})
print(merged_df)
Output:
key df1_value df2_value
0 B 2 5
1 D 4 6
In this example, we use the rename
method to rename the columns of the merged result.
Merge Complete Explanation with Example ( Jupyter notebook Code )
Basic Merge Using a Dataframe Column
In [28]:
import pandas as pd df1 = pd.DataFrame({ "city": ["new york","chicago","orlando"], "temperature": [21,14,35], }) df1
Out[28]:
city | temperature | |
---|---|---|
0 | new york | 21 |
1 | chicago | 14 |
2 | orlando | 35 |
In [29]:
df2 = pd.DataFrame({ "city": ["chicago","new york","orlando"], "humidity": [65,68,75], }) df2
Out[29]:
city | humidity | |
---|---|---|
0 | chicago | 65 |
1 | new york | 68 |
2 | orlando | 75 |
In [30]:
df3 = pd.merge(df1, df2, on="city") df3
Out[30]:
city | temperature | humidity | |
---|---|---|---|
0 | new york | 21 | 68 |
1 | chicago | 14 | 65 |
2 | orlando | 35 | 75 |
Type Of DataBase Joins
<img src=”db_joins.jpg” height=”800″, width=”800″>
In [31]:
df1 = pd.DataFrame({ "city": ["new york","chicago","orlando", "baltimore"], "temperature": [21,14,35, 38], }) df1
Out[31]:
city | temperature | |
---|---|---|
0 | new york | 21 |
1 | chicago | 14 |
2 | orlando | 35 |
3 | baltimore | 38 |
In [32]:
df2 = pd.DataFrame({ "city": ["chicago","new york","san diego"], "humidity": [65,68,71], }) df2
Out[32]:
city | humidity | |
---|---|---|
0 | chicago | 65 |
1 | new york | 68 |
2 | san diego | 71 |
In [33]:
df3=pd.merge(df1,df2,on="city",how="inner") df3
Out[33]:
city | temperature | humidity | |
---|---|---|---|
0 | new york | 21 | 68 |
1 | chicago | 14 | 65 |
In [34]:
df3=pd.merge(df1,df2,on="city",how="outer") df3
Out[34]:
city | temperature | humidity | |
---|---|---|---|
0 | new york | 21.0 | 68.0 |
1 | chicago | 14.0 | 65.0 |
2 | orlando | 35.0 | NaN |
3 | baltimore | 38.0 | NaN |
4 | san diego | NaN | 71.0 |
In [35]:
df3=pd.merge(df1,df2,on="city",how="left") df3
Out[35]:
city | temperature | humidity | |
---|---|---|---|
0 | new york | 21 | 68.0 |
1 | chicago | 14 | 65.0 |
2 | orlando | 35 | NaN |
3 | baltimore | 38 | NaN |
In [36]:
df3=pd.merge(df1,df2,on="city",how="right") df3
Out[36]:
city | temperature | humidity | |
---|---|---|---|
0 | new york | 21.0 | 68 |
1 | chicago | 14.0 | 65 |
2 | san diego | NaN | 71 |
indicator flag
In [37]:
df3=pd.merge(df1,df2,on="city",how="outer",indicator=True) df3
Out[37]:
city | temperature | humidity | _merge | |
---|---|---|---|---|
0 | new york | 21.0 | 68.0 | both |
1 | chicago | 14.0 | 65.0 | both |
2 | orlando | 35.0 | NaN | left_only |
3 | baltimore | 38.0 | NaN | left_only |
4 | san diego | NaN | 71.0 | right_only |
suffixes
In [38]:
df1 = pd.DataFrame({ "city": ["new york","chicago","orlando", "baltimore"], "temperature": [21,14,35,38], "humidity": [65,68,71, 75] }) df1
Out[38]:
city | humidity | temperature | |
---|---|---|---|
0 | new york | 65 | 21 |
1 | chicago | 68 | 14 |
2 | orlando | 71 | 35 |
3 | baltimore | 75 | 38 |
In [39]:
df2 = pd.DataFrame({ "city": ["chicago","new york","san diego"], "temperature": [21,14,35], "humidity": [65,68,71] }) df2
Out[39]:
city | humidity | temperature | |
---|---|---|---|
0 | chicago | 65 | 21 |
1 | new york | 68 | 14 |
2 | san diego | 71 | 35 |
In [40]:
df3= pd.merge(df1,df2,on="city",how="outer", suffixes=('_first','_second')) df3
Out[40]:
city | humidity_first | temperature_first | humidity_second | temperature_second | |
---|---|---|---|---|---|
0 | new york | 65.0 | 21.0 | 68.0 | 14.0 |
1 | chicago | 68.0 | 14.0 | 65.0 | 21.0 |
2 | orlando | 71.0 | 35.0 | NaN | NaN |
3 | baltimore | 75.0 | 38.0 | NaN | NaN |
4 | san diego | NaN | NaN | 71.0 | 35.0 |
join
In [58]:
df1 = pd.DataFrame({ "city": ["new york","chicago","orlando"], "temperature": [21,14,35], }) df1.set_index('city',inplace=True) df1
Out[58]:
temperature | |
---|---|
city | |
new york | 21 |
chicago | 14 |
orlando | 35 |
In [59]:
df2 = pd.DataFrame({ "city": ["chicago","new york","orlando"], "humidity": [65,68,75], }) df2.set_index('city',inplace=True) df2
Out[59]:
humidity | |
---|---|
city | |
chicago | 65 |
new york | 68 |
orlando | 75 |
In [60]:
df1.join(df2,lsuffix='_l', rsuffix='_r')
Out[60]:
temperature | humidity | |
---|---|---|
city | ||
new york | 21 | 68 |
chicago | 14 | 65 |
orlando | 35 | 75 |