How to use merge in python

How to use merge in python

It doesn’t matter whether you’re a data scientist, data analyst, business analyst, or data engineer.

If you’ve been using Python in your work — especially for data preprocessing/cleaning — you’d have used Pandas in some ways.

Why “Merge”?

You’d have probably encountered multiple data tables that have various bits of information that you would like to see all in one place — one dataframe in this case.

And this is where the power of merge comes in to efficiently combine multiple data tables together in a nice and orderly fashion into a single dataframe for further analysis.

“Merging” two datasets is the process of bringing two datasets together into one, and aligning the rows from each based on common attributes or columns.

The words “merge” and “join” are used relatively interchangeably in Pandas and other languages. Despite the fact that Pandas has both “merge” and “join” functions, essentially they both do the similar things.

That said, we’ll focus solely on the “merge” function used in Pandas throughout this article.

What exactly is “Merge”?

How to use merge in python

(Source)

Looking at the documentation of the merge function (pandas version = 0.24.1), it looks foreign and not easily understandable to readers (at least to me) at the first glance.

After seeking more information and explanation from some of my friends and online resources, I started understanding how this concept — merge could actually be explained in a much simpler way and began to appreciate the beauty of this merge function in Pandas.

To understand pd.merge, let’s start with a simple line of code as below. What this line of code does is to merge two dataframes — left_dfand right_df — into one based on their values with the samecolumn_name available in both dataframes. With the how='inner', this will perform inner merge to only combine values in the column_name that match.

pd.merge(left_df, right_df, on='column_name', how='inner'

Since the method how has different parameters (by default Pandas uses inner), we’ll look into different parameters (left, right, inner, outer) and their use cases.

After that we’ll explain the method on and elaborate further on how to merge two dataframes if they have different column names (TIPS: method on will not be used).

The explanation below comes largely from the great tutorial by Shane Lynn and the data used (3 CSV files) comes from the KillBiller application.

You can get the data HERE.

Let’s get started!

A Quick Look at the Data

Let’s first understand the data sets used with the following explanation on each dataframe.

  • user_usage — A first dataset containing users monthly mobile usage statistics
  • user_device — A second dataset containing details of an individual “use” of the system, with dates and device information
  • android_device — A third dataset with device and manufacturer data, which lists all Android devices and their model code

How to use merge in python

user_usage

How to use merge in python

user_device

How to use merge in python

android_device

It’s important to note here that:

  1. The column name use_id is shared between the user_usage and user_device
  2. The device column of user_device and Model column of the android_device dataframe contain common codes

1. LEFT Merge

Keep every row in the left dataframe. Where there are missing values of the “on” variable in the right dataframe, add empty / NaN values in the result.

How to use merge in python

LEFT Merge

With the operation above, left_merge has the same size as user_usage as we keep all the rows in the left dataframe using the left parameter for the method how .

How to use merge in python

LEFT Merge (Top 5 rows)

How to use merge in python

LEFT Merge (Last 5 rows)

As expected, the column use_id has already been merged together. We also see that the empty values are replaced by NaN in the right dataframe — user_device.

2. RIGHT Merge

To perform the right merge, we just repeat the code above by simply changing the parameter of how from left to right.

How to use merge in python

RIGHT Merge

With the operation above, right_merge has the same size as user_device as we keep all the rows in the right dataframe using the right parameter for the method how.

How to use merge in python

RIGHT Merge (Top 5 rows)

How to use merge in python

RIGHT Merge (Last 5 rows)

This time, we see that the empty values are replaced by NaN in the left dataframe — user_usage.

3. INNER Merge

Pandas uses “inner” merge by default. This keeps only the common values in both the left and right dataframes for the merged data.

In our case, only the rows that contain use_id values that are common between user_usage and user_device remain in the merged data — inner_merge.

How to use merge in python

INNER Merge

Although the “inner” merge is used by Pandas by default, the parameter inner is specified above to be explicit.

With the operation above, the merged data — inner_merge has different size compared to the original left and right dataframes (user_usage & user_device) as only common values are merged.

4. OUTER Merge

Finally, we have “outer” merge.

The “outer” merge combines all the rows for left and right dataframes with NaN when there are no matched values in the rows.

How to use merge in python

OUTER Merge

Notice that the method indicator is set to True in order to indicate where each row originates from in the merge data — outer_merge.

How to use merge in python

Specified rows in outer_merge

To further illustrate how the “outer” merge works, we purposely specify certain rows of the outer_merge to understand where the rows originate from.

For the 1st and 2th rows, the rows come from both the dataframes as they have the same values of use_id to be merged.

For the 3rd and 4th rows, the rows come from the left dataframe as the right dataframe doesn’t have the common values of use_id.

For the 5th and 6th rows, the rows come from the right dataframe as the left dataframe doesn’t have the common values of use_id.

💡 Merge Dataframes with Different Column Names

So we’ve talked about how to merge data using different ways — left, right, inner, and outer.

But the method on only works for the same column name in the left and right dataframes.

Therefore, we use left_on and right_on to replace the method on as shown below.

How to use merge in python

LEFT Merge for dataframes with different columns names

Here we’ve merged user_device with android_device since they both contain common codes in their columns — device and Model respectively.

How to use merge in python

LEFT Merge for dataframes with different column names

Final Thoughts

(Source)

Thank you for reading.

Merging data from different data tables is one of the most common tasks that we as a data professional have to do.

I hope this sharing is beneficial to you in some ways.

As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn. Till then, see you in the next post! 😄

About the Author

Admond Lee is currently the Co-Founder/CTO of Staqthe #1 business banking API platform for Southeast Asia.

Want to get free weekly data science and startup insights?

Join Admond’s email newsletter — Hustle Hub, where every week he shares actionable data science career tips, mistakes & learnings from building his startup — Staq.

You can connect with him on LinkedIn, Medium, Twitter, and Facebook.

How do you use merge function in Python?

Must be found in both the left and right DataFrame objects. left_on − Columns from the left DataFrame to use as keys. ... Merge Using 'how' Argument..

How do you merge rows in Python?

To merge rows within a group together in Pandas we can use the agg(~) method together with the join(~) method to concatenate the row values.

How do you merge series in Python?

concat() method you can combine/merge two or more series into a DataFrame (create DataFrame from multiple series). Besides this you can also use Series. append() , pandas..
Using pandas. concat() to Combine Two Series. ... .
Combine Two Series Using pandas. merge() ... .
Combine Two Series Using DataFrame. join() ... .
Using Series..

How do you merge two columns in Python?

By use + operator simply you can combine/merge two or multiple text/string columns in pandas DataFrame. Note that when you apply + operator on numeric columns it actually does addition instead of concatenation.