patternpythonCriticalCanonical
Pandas Merging 101
Viewed 0 times
pandasmerging101
Problem
- How can I perform a (
INNER| (LEFT|RIGHT|FULL)OUTER)JOINwith pandas?
- How do I add NaNs for missing rows after a merge?
- How do I get rid of NaNs after merging?
- Can I merge on the index?
- How do I merge multiple DataFrames?
- Cross join with pandas
merge?join?concat?update? Who? What? Why?!
... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.
This Q&A is meant to be the next installment in a series of helpful user guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).
Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.
Table of Contents
For ease of access.
-
Merging basics - basic types of joins (read this first)
-
Index-based joins
-
Generalizing to multiple DataFrames
-
Cross join
Solution
This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.
In particular, here's what this post will go through:
-
The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
What this post (and other posts by me on this thread) will not go through:
Note
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.
Furthermore, all the DataFrames here can be copied and replicated so
you can play with them. Also, see this
post
on how to read DataFrames from your clipboard.
Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.
Enough talk - just show me how to use
Setup & Basics
For the sake of simplicity, the key column has the same name (for now).
An INNER JOIN is represented by
Note
This, along with the forthcoming figures all follow this convention:
To perform an INNER JOIN, call
This returns only rows from
A LEFT OUTER JOIN, or LEFT JOIN is represented by
This can be performed by specifying
Carefully note the placement of NaNs here. If you specify
And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...
...specify
Here, keys from
Finally, for the FULL OUTER JOIN, given by
specify
This uses the keys from both frames, and NaNs are inserted for missing rows in both.
The documentation summarizes these various merges nicely:
Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs
If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.
For LEFT-Excluding JOIN, represented as
Start by performing a LEFT OUTER JOIN and then filtering to rows coming from
Where,
And similarly, for a RIGHT-Excluding JOIN,
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),
You can do this in similar fashion—
```
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', axis=1))
key value_x value_y
In particular, here's what this post will go through:
-
The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
- merging with different column names
- merging with multiple columns
- avoiding duplicate merge key column in output
What this post (and other posts by me on this thread) will not go through:
- Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
- Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!
Note
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.
Furthermore, all the DataFrames here can be copied and replicated so
you can play with them. Also, see this
post
on how to read DataFrames from your clipboard.
Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.
Enough talk - just show me how to use
merge!Setup & Basics
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right
key value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357For the sake of simplicity, the key column has the same name (for now).
An INNER JOIN is represented by
Note
This, along with the forthcoming figures all follow this convention:
- blue indicates rows that are present in the merge result
- red indicates rows that are excluded from the result (i.e., removed)
- green indicates missing values that are replaced with
NaNs in the result
To perform an INNER JOIN, call
merge on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278This returns only rows from
left and right which share a common key (in this example, "B" and "D).A LEFT OUTER JOIN, or LEFT JOIN is represented by
This can be performed by specifying
how='left'.left.merge(right, on='key', how='left')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278Carefully note the placement of NaNs here. If you specify
how='left', then only keys from left are used, and missing data from right is replaced by NaN.And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...
...specify
how='right':left.merge(right, on='key', how='right')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357Here, keys from
right are used, and missing data from left is replaced by NaN.Finally, for the FULL OUTER JOIN, given by
specify
how='outer'.left.merge(right, on='key', how='outer')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357This uses the keys from both frames, and NaNs are inserted for missing rows in both.
The documentation summarizes these various merges nicely:
Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs
If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.
For LEFT-Excluding JOIN, represented as
Start by performing a LEFT OUTER JOIN and then filtering to rows coming from
left only (excluding everything from the right),(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', axis=1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaNWhere,
left.merge(right, on='key', how='left', indicator=True)
key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 bothAnd similarly, for a RIGHT-Excluding JOIN,
(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', axis=1))
key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),
You can do this in similar fashion—
```
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', axis=1))
key value_x value_y
Code Snippets
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right
key value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278left.merge(right, on='key', how='left')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278left.merge(right, on='key', how='right')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357left.merge(right, on='key', how='outer')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357Context
Stack Overflow Q#53645882, score: 1303
Revisions (0)
No revisions yet.