If you want to check equal values on a certain column, let's say Name, you can merge both DataFrames to a new one:
mergedStuff = pd.merge(df1, df2, on=['Name'], how='inner')
mergedStuff.head()
I think this is more efficient and faster than where if you have a big data set.
If you want to check equal values on a certain column, let's say Name, you can merge both DataFrames to a new one:
mergedStuff = pd.merge(df1, df2, on=['Name'], how='inner')
mergedStuff.head()
I think this is more efficient and faster than where if you have a big data set.
You can double check the exact number of common and different positions between two df by using isin and value_counts().
Like that:
df['your_column_name'].isin(df2['your_column_name']).value_counts()
Result:

True = common
False = different
python 3.x - How to compare two dataframes based on certain column values and remove them in pandas - Stack Overflow
Is this the most efficient way to compare 2 Pandas dataframes? Finding rows unique to one dataframe.
python - How to compare two columns of two dataframes and indicate the changes? - Data Science Stack Exchange
python - Compare two DataFrames and output their differences side-by-side - Stack Overflow
Videos
Just do with simple merge follow with dropna
df2.merge(df1,how='left').dropna().drop('Security',1)
Out[318]:
userID ID Sex Date Month Year
4 John 45 Male 31 3 1975
5 Tom 22 Male 1 1 1990
7 Hary 56 Male 15 9 1970
Define the key columns which you want to merge on, and then perform an inner merge between df2 and only the key columns of df1. The default for merge is inner, so you don't need to specify it explicitly. Subsetting df1 to only these key columns ensures that you don't bring any of its columns over to df2 with the merge.
key_cols = ['userID', 'ID', 'Date', 'Month', 'Year']
df2.merge(df1.loc[:, df1.columns.isin(key_cols)])
Outputs:
userID ID Sex Date Month Year
0 John 45 Male 31 3 1975
1 Tom 22 Male 1 1 1990
2 Hary 56 Male 15 9 1970
Still trying to wrap my head around Pandas (and continuing to be blown away by its capabilities every day...
Say I have 2 dataframes (lets call the left_test_df and right_test_df). And, I want to see which rows are only present in one of them. Is something like this the best way to go about doing it?
merged_df=pd.merge(left_test_df, right_test_df, how='right', indicator=True)
final_df=merged_df[merged_df['_merge']=='right_only']import pandas as pd
import numpy as np
old = pd.DataFrame({
"ID": ["AA", "BB", "CC"],
"Rating": ["High", "Low", "Medium"],
"Status": ["On track", "Monitor", "On track"]
})
new = pd.DataFrame({
"ID": ["AA", "BB", "CC", "DD"],
"Rating": ["Medium", "High", "Medium", "Low"],
"Status": ["On track", "On track", "On track", "Monitor"]
})
(
old
# join the two dataframes used the ID column as a key
.merge(new, how="outer", on="ID", suffixes=("_old", "_new"))
# compare columns between old and new dataframe and assign new values
.assign(
Rating = lambda x: np.select(
[x["Rating_new"].notna() & x["Rating_old"].isna(), x["Rating_new"] != x["Rating_old"]],
["New", "From '" + x["Rating_old"] + "' To '" + x["Rating_new"] + "'"],
default=np.nan
),
Status = lambda x: np.select(
[x["Status_new"].notna() & x["Status_old"].isna(), x["Status_new"] != x["Status_old"]],
["New", "From '" + x["Status_old"] + "' To '" + x["Status_new"] + "'"],
default=np.nan
)
)
# select final columns
.loc[:, ["ID", "Rating", "Status"]]
)
| ID | Rating | Status |
|---|---|---|
| AA | From 'High' To 'Medium' | nan |
| BB | From 'Low' To 'High' | From 'Monitor' To 'On track' |
| CC | nan | nan |
| DD | New | New |
Please merge (left Join) the current table to previous table, Now you will have all the 4 columns in one dataframe. You can apply concatenate of columns to get desired results.
Please share dataframe creation code if you need help with code creat
The first part is similar to Constantine, you can get the boolean of which rows are empty*:
In [21]: ne = (df1 != df2).any(1)
In [22]: ne
Out[22]:
0 False
1 True
2 True
dtype: bool
Then we can see which entries have changed:
In [23]: ne_stacked = (df1 != df2).stack()
In [24]: changed = ne_stacked[ne_stacked]
In [25]: changed.index.names = ['id', 'col']
In [26]: changed
Out[26]:
id col
1 score True
2 isEnrolled True
Comment True
dtype: bool
Here the first entry is the index and the second the columns which has been changed.
In [27]: difference_locations = np.where(df1 != df2)
In [28]: changed_from = df1.values[difference_locations]
In [29]: changed_to = df2.values[difference_locations]
In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
from to
id col
1 score 1.11 1.21
2 isEnrolled True False
Comment None On vacation
* Note: it's important that df1 and df2 share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index, but I think I'll leave that as an exercise.
Highlighting the difference between two DataFrames
It is possible to use the DataFrame style property to highlight the background color of the cells where there is a difference.
Using the example data from the original question
The first step is to concatenate the DataFrames horizontally with the concat function and distinguish each frame with the keys parameter:
df_all = pd.concat([df.set_index('id'), df2.set_index('id')],
axis='columns', keys=['First', 'Second'])
df_all

It's probably easier to swap the column levels and put the same column names next to each other:
df_final = df_all.swaplevel(axis='columns')[df.columns[1:]]
df_final

Now, its much easier to spot the differences in the frames. But, we can go further and use the style property to highlight the cells that are different. We define a custom function to do this which you can see in this part of the documentation.
def highlight_diff(data, color='yellow'):
attr = 'background-color: {}'.format(color)
other = data.xs('First', axis='columns', level=-1)
return pd.DataFrame(np.where(data.ne(other, level=0), attr, ''),
index=data.index, columns=data.columns)
df_final.style.apply(highlight_diff, axis=None)

This will highlight cells that both have missing values. You can either fill them or provide extra logic so that they don't get highlighted.