I think this is what you want to do (both CSVs I use are identical to what you have in your question):
import pandas as pd
df_1 = pd.read_csv('document1.csv')
df_2 = pd.read_csv('document2.csv')
key_cols = ['job_function', 'job_area', 'title']
merged_df = pd.merge(df_1, df_2, how='left', left_on=key_cols, right_on=key_cols)
Source: How to join two dataframes on multiple columns
Answer from Ignacio HM on Stack OverflowVideos
According to df.merge docs validate was added in version 0.21.0. You are using an older version so you should update the version of pandas you are using.
As @DeepSpace mentioned, you may need to upgrade your pandas.
To replicate the check in earlier versions, you can do something like this:
import pandas as pd
df1 = pd.DataFrame(index=['a', 'a', 'b', 'b', 'c'])
df2 = pd.DataFrame(index=['a', 'b', 'c'])
x = [i for i in df2.index if i in set(df1.index)]
len(x) == len(set(x)) # True
df1 = pd.DataFrame(index=['a', 'a', 'b', 'b', 'c'])
df2 = pd.DataFrame(index=['a', 'b', 'c', 'a'])
y = [i for i in df2.index if i in set(df1.index)]
len(y) == len(set(y)) # False
Zero's answer is basically a reduce operation. If I had more than a handful of dataframes, I'd put them in a list like this (generated via list comprehensions or loops or whatnot):
dfs = [df0, df1, df2, ..., dfN]
Assuming they have a common column, like name in your example, I'd do the following:
import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name'), dfs)
That way, your code should work with whatever number of dataframes you want to merge.
You could try this if you have 3 dataframes
# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')
alternatively, as mentioned by cwharland
df1.merge(df2,on='name').merge(df3,on='name')
You can create new join key (ie, helperkey) that would uniquely identify each row for your joining columns.
joincols = ['date', 'period', 'company']
df1m = df1.assign(helperkey=df1.groupby(joincols).cumcount())
df2m = df2.assign(helperkey=df2.groupby(joincols).cumcount())
df1m.merge(df2m, on=joincols + ['helperkey'], how='left').drop('helperkey', axis=1)
Output:
date period company value
0 2025-03-01 1 aa 4.0
1 2025-03-01 1 aa NaN
2 2025-03-02 2 b 8.0
Note: Here, a helperkey column is created using groupby cumcount then added that new temporary column to each dataframe join and drop that helper column.
merged = df1.merge(df2, on=['date','period','company'], how='left')
merged.loc[merged.duplicated(subset=['date','period','company']), 'value'] = np.nan
This worked for me, and it would work cleanly with large data sets as well. I replicated the code in google colab and got the following result.
