When the names are different, use the xxx_on parameters instead of on=:
pd.merge(df1, df2, left_on= ['userid', 'column1'],
right_on= ['username', 'column1'],
how = 'left')
Answer from Zeugma on Stack OverflowWhen the names are different, use the xxx_on parameters instead of on=:
pd.merge(df1, df2, left_on= ['userid', 'column1'],
right_on= ['username', 'column1'],
how = 'left')
An alternative approach is to use join setting the index of the right hand side DataFrame to the columns ['username', 'column1']:
df1.join(df2.set_index(['username', 'column1']), on=['userid', 'column1'], how='left')
The output of this join merges the matched keys from the two differently named key columns, userid and username, into a single column named after the key column of df1, userid; whereas the output of the merge maintains the two as separate columns. To illustrate, consider the following example:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'ID': [1,2,3,4,5,6], 'pID' : [21,22,23,24,25,26], 'Values' : [435,33,45,np.nan,np.nan,12]})
## ID Values pID
## 0 1 435.0 21
## 1 2 33.0 22
## 2 3 45.0 23
## 3 4 NaN 24
## 4 5 NaN 25
## 5 6 12.0 26
df2 = pd.DataFrame({'ID' : [4,4,5], 'pid' : [24,25,25], 'Values' : [544, 545, 676]})
## ID Values pid
## 0 4 544 24
## 1 4 545 25
## 2 5 676 25
pd.merge(df1, df2, how='left', left_on=['ID', 'pID'], right_on=['ID', 'pid']))
## ID Values_x pID Values_y pid
## 0 1 435.0 21 NaN NaN
## 1 2 33.0 22 NaN NaN
## 2 3 45.0 23 NaN NaN
## 3 4 NaN 24 544.0 24.0
## 4 5 NaN 25 676.0 25.0
## 5 6 12.0 26 NaN NaN
df1.join(df2.set_index(['ID','pid']), how='left', on=['ID','pID'], lsuffix='_x', rsuffix='_y'))
## ID Values_x pID Values_y
## 0 1 435.0 21 NaN
## 1 2 33.0 22 NaN
## 2 3 45.0 23 NaN
## 3 4 NaN 24 544.0
## 4 5 NaN 25 676.0
## 5 6 12.0 26 NaN
Here, we also need to specify lsuffix and rsuffix in join to distinguish the overlapping column Value in the output. As one can see, the output of merge contains the extra pid column from the right hand side DataFrame, which IMHO is unnecessary given the context of the merge. Note also that the dtype for the pid column has changed to float64, which results from upcasting due to the NaNs introduced from the unmatched rows.
This aesthetic output is gained at a cost in performance as the call to set_index on the right hand side DataFrame incurs some overhead. However, a quick and dirty profile shows that this is not too horrible, roughly 30%, which may be worth it:
sz = 1000000 # one million rows
df1 = pd.DataFrame({'ID': np.arange(sz), 'pID' : np.arange(0,2*sz,2), 'Values' : np.random.random(sz)})
df2 = pd.DataFrame({'ID': np.concatenate([np.arange(sz/2),np.arange(sz/2)]), 'pid' : np.arange(0,2*sz,2), 'Values' : np.random.random(sz)})
%timeit pd.merge(df1, df2, how='left', left_on=['ID', 'pID'], right_on=['ID', 'pid'])
## 818 ms ± 33.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df1.join(df2.set_index(['ID','pid']), how='left', on=['ID','pID'], lsuffix='_x', rsuffix='_y')
## 1.04 s ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Pandas merge multiple dataframes with different columns
python - Joining pandas DataFrames by Column names - Stack Overflow
python - pandas merge on columns with different names and avoid duplicates - Stack Overflow
python - pandas: merge (join) two data frames on multiple columns - Stack Overflow
Videos
Using pandas, I'm trying to merge approx 150 dataframes from a dictionary into one dataframe. Not all of the files have the same number of columns. 90% of the column names are the same. My intention is to have Null/NaN where the data is absent for a given dataframe.
I tried examples shown here: https://stackoverflow.com/questions/28097222/pandas-merge-two-dataframes-with-different-columns and here: https://www.geeksforgeeks.org/pandas-merge-two-dataframes-with-different-columns/
df = pd.concat(d.values(), axis=0, ignore_index=True)
df = pd.concat(d.values(), ignore_index=True, sort = False)
But I keep getting the error below. Any help would be appreciated.
pandas.errors.ParserError: Error tokenizing data. C error: Expected 50 fields in line 5, saw 51
You can use the left_on and right_on options of pd.merge as follows:
pd.merge(frame_1, frame_2, left_on='county_ID', right_on='countyid')
Or equivalently with DataFrame.merge:
frame_1.merge(frame_2, left_on='county_ID', right_on='countyid')
I was not sure from the question if you only wanted to merge if the key was in the left hand DataFrame. If that is the case then the following will do that (the above will in effect do a many to many merge)
pd.merge(frame_1, frame_2, how='left', left_on='county_ID', right_on='countyid')
Or
frame_1.merge(frame_2, how='left', left_on='county_ID', right_on='countyid')
you need to make county_ID as index for the right frame:
frame_2.join ( frame_1.set_index( [ 'county_ID' ], verify_integrity=True ),
on=[ 'countyid' ], how='left' )
for your information, in pandas left join breaks when the right frame has non unique values on the joining column. see this bug.
so you need to verify integrity before joining by , verify_integrity=True
How about set the UserID as index and then join on index for the second data frame?
pd.merge(df1, df2.set_index('UserID'), left_on='UserName', right_index=True)
# Col1 UserName Col2
# 0 a 1 d
# 1 b 2 e
# 2 c 3 f
There is nothing really nice in it: it's meant to be keeping the columns as the larger cases like left right or outer joins would bring additional information with two columns. Don't try to overengineer your merge line, be explicit as you suggest
Solution 1:
df2.columns = ['Col2', 'UserName']
pd.merge(df1, df2,on='UserName')
Out[67]:
Col1 UserName Col2
0 a 1 d
1 b 2 e
2 c 3 f
Solution 2:
pd.merge(df1, df2, left_on='UserName', right_on='UserID').drop('UserID', axis=1)
Out[71]:
Col1 UserName Col2
0 a 1 d
1 b 2 e
2 c 3 f
Try this
new_df = pd.merge(
left=A_df,
right=B_df,
how='left',
left_on=['A_c1', 'c2'],
right_on=['B_c1', 'c2'],
)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
left_on : label or list, or array-like Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns
right_on : label or list, or array-like Field names to join on in right DataFrame or vector/list of vectors per left_on docs
It merges according to the ordering of
left_onandright_on, i.e., the i-th element ofleft_onwill match with the i-th ofright_on.In the example below, the code on the top matches
A_col1withB_col1andA_col2withB_col2, while the code on the bottom matchesA_col1withB_col2andA_col2withB_col1. Evidently, the results are different.
As can be seen from the above example, if the merge keys have different names, all keys will show up as their individual columns in the merged dataframe. In the example above, in the top dataframe,
A_col1andB_col1are identical andA_col2andB_col2are identical. In the bottom dataframe,A_col1andB_col2are identical andA_col2andB_col1are identical. Since these are duplicate columns, they are most likely not needed. One way to not have this problem from the beginning is to make the merge keys identical from the beginning. See bullet point #3 below.If
left_onandright_onare the samecol1andcol2, we can useon=['col1', 'col2']. In this case, no merge keys are duplicated.df1.merge(df2, on=['col1', 'col2'])
You can also merge one side on column names and the other side on index too. For example, in the example below,
df1's columns are matched withdf2's indices. If the indices are named, as in the example below, you can reference them by name but if not, you can also useright_index=True(orleft_index=Trueif the left dataframe is the one being merged on index).df1.merge(df2, left_on=['A_col1', 'A_col2'], right_index=True) # or df1.merge(df2, left_on=['A_col1', 'A_col2'], right_on=['B_col1', 'B_col2'])
By using the
how=parameter, you can performLEFT JOIN(how='left'),FULL OUTER JOIN(how='outer') andRIGHT JOIN(how='right') as well. The default isINNER JOIN(how='inner') as in the examples above.If you have more than 2 dataframes to merge and the merge keys are the same across all of them, then
joinmethod is more efficient thanmergebecause you can pass a list of dataframes and join on indices. Note that the index names are the same across all dataframes in the example below (col1andcol2). Note that the indices don't have to have names; if the indices don't have names, then the number of the multi-indices must match (in the case below there are 2 multi-indices). Again, as in bullet point #1, the match occurs according to the ordering of the indices.df1.join([df2, df3], how='inner').reset_index()