When the names are different, use the xxx_on parameters instead of on=:
pd.merge(df1, df2, left_on= ['userid', 'column1'],
right_on= ['username', 'column1'],
how = 'left')
Answer from Zeugma on Stack Overflowpython - Pandas join on columns with different names - Stack Overflow
python - pandas: merge (join) two data frames on multiple columns - Stack Overflow
python - Joining pandas DataFrames by Column names - Stack Overflow
Pandas merge multiple dataframes with different columns
Videos
When the names are different, use the xxx_on parameters instead of on=:
pd.merge(df1, df2, left_on= ['userid', 'column1'],
right_on= ['username', 'column1'],
how = 'left')
An alternative approach is to use join setting the index of the right hand side DataFrame to the columns ['username', 'column1']:
df1.join(df2.set_index(['username', 'column1']), on=['userid', 'column1'], how='left')
The output of this join merges the matched keys from the two differently named key columns, userid and username, into a single column named after the key column of df1, userid; whereas the output of the merge maintains the two as separate columns. To illustrate, consider the following example:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'ID': [1,2,3,4,5,6], 'pID' : [21,22,23,24,25,26], 'Values' : [435,33,45,np.nan,np.nan,12]})
## ID Values pID
## 0 1 435.0 21
## 1 2 33.0 22
## 2 3 45.0 23
## 3 4 NaN 24
## 4 5 NaN 25
## 5 6 12.0 26
df2 = pd.DataFrame({'ID' : [4,4,5], 'pid' : [24,25,25], 'Values' : [544, 545, 676]})
## ID Values pid
## 0 4 544 24
## 1 4 545 25
## 2 5 676 25
pd.merge(df1, df2, how='left', left_on=['ID', 'pID'], right_on=['ID', 'pid']))
## ID Values_x pID Values_y pid
## 0 1 435.0 21 NaN NaN
## 1 2 33.0 22 NaN NaN
## 2 3 45.0 23 NaN NaN
## 3 4 NaN 24 544.0 24.0
## 4 5 NaN 25 676.0 25.0
## 5 6 12.0 26 NaN NaN
df1.join(df2.set_index(['ID','pid']), how='left', on=['ID','pID'], lsuffix='_x', rsuffix='_y'))
## ID Values_x pID Values_y
## 0 1 435.0 21 NaN
## 1 2 33.0 22 NaN
## 2 3 45.0 23 NaN
## 3 4 NaN 24 544.0
## 4 5 NaN 25 676.0
## 5 6 12.0 26 NaN
Here, we also need to specify lsuffix and rsuffix in join to distinguish the overlapping column Value in the output. As one can see, the output of merge contains the extra pid column from the right hand side DataFrame, which IMHO is unnecessary given the context of the merge. Note also that the dtype for the pid column has changed to float64, which results from upcasting due to the NaNs introduced from the unmatched rows.
This aesthetic output is gained at a cost in performance as the call to set_index on the right hand side DataFrame incurs some overhead. However, a quick and dirty profile shows that this is not too horrible, roughly 30%, which may be worth it:
sz = 1000000 # one million rows
df1 = pd.DataFrame({'ID': np.arange(sz), 'pID' : np.arange(0,2*sz,2), 'Values' : np.random.random(sz)})
df2 = pd.DataFrame({'ID': np.concatenate([np.arange(sz/2),np.arange(sz/2)]), 'pid' : np.arange(0,2*sz,2), 'Values' : np.random.random(sz)})
%timeit pd.merge(df1, df2, how='left', left_on=['ID', 'pID'], right_on=['ID', 'pid'])
## 818 ms ยฑ 33.4 ms per loop (mean ยฑ std. dev. of 7 runs, 1 loop each)
%timeit df1.join(df2.set_index(['ID','pid']), how='left', on=['ID','pID'], lsuffix='_x', rsuffix='_y')
## 1.04 s ยฑ 18.2 ms per loop (mean ยฑ std. dev. of 7 runs, 1 loop each)
Try this
new_df = pd.merge(
left=A_df,
right=B_df,
how='left',
left_on=['A_c1', 'c2'],
right_on=['B_c1', 'c2'],
)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
left_on : label or list, or array-like Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns
right_on : label or list, or array-like Field names to join on in right DataFrame or vector/list of vectors per left_on docs
It merges according to the ordering of
left_onandright_on, i.e., the i-th element ofleft_onwill match with the i-th ofright_on.In the example below, the code on the top matches
A_col1withB_col1andA_col2withB_col2, while the code on the bottom matchesA_col1withB_col2andA_col2withB_col1. Evidently, the results are different.
As can be seen from the above example, if the merge keys have different names, all keys will show up as their individual columns in the merged dataframe. In the example above, in the top dataframe,
A_col1andB_col1are identical andA_col2andB_col2are identical. In the bottom dataframe,A_col1andB_col2are identical andA_col2andB_col1are identical. Since these are duplicate columns, they are most likely not needed. One way to not have this problem from the beginning is to make the merge keys identical from the beginning. See bullet point #3 below.If
left_onandright_onare the samecol1andcol2, we can useon=['col1', 'col2']. In this case, no merge keys are duplicated.df1.merge(df2, on=['col1', 'col2'])
You can also merge one side on column names and the other side on index too. For example, in the example below,
df1's columns are matched withdf2's indices. If the indices are named, as in the example below, you can reference them by name but if not, you can also useright_index=True(orleft_index=Trueif the left dataframe is the one being merged on index).df1.merge(df2, left_on=['A_col1', 'A_col2'], right_index=True) # or df1.merge(df2, left_on=['A_col1', 'A_col2'], right_on=['B_col1', 'B_col2'])
By using the
how=parameter, you can performLEFT JOIN(how='left'),FULL OUTER JOIN(how='outer') andRIGHT JOIN(how='right') as well. The default isINNER JOIN(how='inner') as in the examples above.If you have more than 2 dataframes to merge and the merge keys are the same across all of them, then
joinmethod is more efficient thanmergebecause you can pass a list of dataframes and join on indices. Note that the index names are the same across all dataframes in the example below (col1andcol2). Note that the indices don't have to have names; if the indices don't have names, then the number of the multi-indices must match (in the case below there are 2 multi-indices). Again, as in bullet point #1, the match occurs according to the ordering of the indices.df1.join([df2, df3], how='inner').reset_index()
You can use the left_on and right_on options of pd.merge as follows:
pd.merge(frame_1, frame_2, left_on='county_ID', right_on='countyid')
Or equivalently with DataFrame.merge:
frame_1.merge(frame_2, left_on='county_ID', right_on='countyid')
I was not sure from the question if you only wanted to merge if the key was in the left hand DataFrame. If that is the case then the following will do that (the above will in effect do a many to many merge)
pd.merge(frame_1, frame_2, how='left', left_on='county_ID', right_on='countyid')
Or
frame_1.merge(frame_2, how='left', left_on='county_ID', right_on='countyid')
you need to make county_ID as index for the right frame:
frame_2.join ( frame_1.set_index( [ 'county_ID' ], verify_integrity=True ),
on=[ 'countyid' ], how='left' )
for your information, in pandas left join breaks when the right frame has non unique values on the joining column. see this bug.
so you need to verify integrity before joining by , verify_integrity=True
Using pandas, I'm trying to merge approx 150 dataframes from a dictionary into one dataframe. Not all of the files have the same number of columns. 90% of the column names are the same. My intention is to have Null/NaN where the data is absent for a given dataframe.
I tried examples shown here: https://stackoverflow.com/questions/28097222/pandas-merge-two-dataframes-with-different-columns and here: https://www.geeksforgeeks.org/pandas-merge-two-dataframes-with-different-columns/
df = pd.concat(d.values(), axis=0, ignore_index=True)
df = pd.concat(d.values(), ignore_index=True, sort = False)
But I keep getting the error below. Any help would be appreciated.
pandas.errors.ParserError: Error tokenizing data. C error: Expected 50 fields in line 5, saw 51
pandas.merge() is a class function orientated to produce joins of Databases with primary keys and foreign keys as in SQL Style databases. See Difference Between Primary and Foreign Key.
The problem here is that you are trying to introduce values of different dtypes (use df.dtypes to see the types of all columns in your DataFrames) to an existing column. That happens because pandas takes the left DataFrame assigned in the function as the "base", and tries to add new records to it, since the dtype is different, it causes an error.
In fact, the documentation is more likely to appear as a pd.DataFrame method, because it is behaved as a (say) "Mother DataFrame that receives new rows". See documentation pd.DataFrame.merge
The error also recommends to use the pandas.concat method, since it sees that the dtypes are different and thinks you may are willing to just join two DataFrames. Which can be preferible, if there are no existing records that have the same TrackName and Artist (for example), in that case you would like to join them with a concat, because there is no additional information you can gain about a record using another DataFrame.
My recommendation is: rename columns in DataFrame 2019 as they are in DataFrame 2018, with the same name if they refer to the same attribute, you can use pd.DataFrame.rename, then, change the dtype of the columns that you will like to do the merge on and make sure they are the same. Finally, try to do an Outer Join with the merge function, using the Song Name, for example. You will see if there are matches or see that all records may be different databases.
So you are not able to merge on ID as ID is of object datatype in one table and int in other table:
df_2018.dtypes
id object
name object
artists object
df_2019.dtypes
ID int64
TrackName object
ArtistName object
Now I tried merging two tables on 'name' and 'artists' and I was able to do that. Here is the code:
new_df = pd.merge(df_2018, df_2019, left_on=['name','artists'], right_on = ['TrackName','ArtistName'])
new_df.columns
Index(['id', 'name', 'artists', 'danceability', 'energy', 'key', 'loudness',
'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness',
'valence', 'tempo', 'duration_ms', 'time_signature', 'ID', 'TrackName',
'ArtistName', 'Genre', 'BeatsPerMinute', 'Energy', 'Danceability',
'LoudnessdB', 'Liveness', 'Valence', 'Length', 'Acousticness',
'Speechiness', 'Popularity'],
dtype='object')
I could get all the columns as desired. Let me know if you are still facing any issues. Do share columns for which you are facing an issue
Well, if you declare column A as index, it works:
Both_DFs = pd.merge(df1.set_index('A', drop=True),df2.set_index('A', drop=True), how='left',left_on=['B'],right_on=['CC'], left_index=True, right_index=True).dropna().reset_index()
This results in:
A B C BB CC DD
0 A1 123 K0 B0 121 D0
1 A1 345 K1 B0 121 D0
2 A3 146 K1 B3 345 D1
EDIT
You just needed:
Both_DFs = pd.merge(df1,df2, how='left',left_on=['A','B'],right_on=['A','CC']).dropna()
Which gives:
A B C BB CC DD
0 A1 121 K0 B0 121 D0
You can also use join with default left join or merge, last if necessary remove rows with NaNs by dropna:
print (df1.join(df2.set_index('A'), on='A').dropna())
A B C BB CC DD
0 A1 123 K0 B0 121 D0
1 A1 345 K1 B0 121 D0
3 A3 146 K1 B3 345 D1
print (pd.merge(df1, df2, on='A', how='left').dropna())
A B C BB CC DD
0 A1 123 K0 B0 121 D0
1 A1 345 K1 B0 121 D0
3 A3 146 K1 B3 345 D1
EDIT:
I think you need inner join (by default, so on='inner' can be omit):
Both_DFs = pd.merge(df1,df2, left_on=['A','B'],right_on=['A','CC'])
print (Both_DFs)
A B C BB CC DD
0 A1 121 K0 B0 121 D0