filter

stackoverflow.com › questions › 33225631 › find-duplicates-with-groupby-in-pandas

python - Find duplicates with groupby in Pandas - Stack Overflow

joelmccune.com › groupby-and-aggregate-using-pandas

1 of 1

Simply add the reset_index() to realign aggregates to a new dataframe.

data_groups = df.groupby(df.columns.tolist())
size = data_groups.size().reset_index() 
size[size[0] > 1]        # DATAFRAME OF DUPLICATES

len(size[size[0] > 1])   # NUMBER OF DUPLICATES

Joel McCune

GroupBy and Aggregate Using Pandas

April 12, 2021 - df_alias = df.drop(columns=aggregate_column).set_index(groupby_column) agg_df.join(df_alias).reset_index(groupby_column).drop_duplicates(groupby_column).reset_index(drop=True) The output, initially 21 rows, is now only 10. If we want to consolidate this into a succicent function, one easily copied and pasted, here is what it looks like. import pandas as pd def aggregate_column(df: pd.DataFrame, groupby_column: str = 'name', aggregate_column: str = 'data_collection') -> pd.DataFrame: # make sure the columns are in the dataframe assert groupby_column in df.columns, f'"groupby_column", {groupby_column}, ' 'does not appear to be in the input ' 'DataFrame columns.' assert aggregate_column in df.columns, f'"aggregate_column", {aggregate_column}, ' 'does not appear to be in the input ' 'DataFrame columns.'

Pandas

pandas.pydata.org › docs › user_guide › groupby.html

Group by: split-apply-combine — pandas 2.3.0 documentation

In [13]: def get_letter_type(letter): ....: if letter.lower() in 'aeiou': ....: return 'vowel' ....: else: ....: return 'consonant' ....: In [14]: grouped = df.T.groupby(get_letter_type) pandas Index objects support duplicate values.

IncludeHelp

includehelp.com › python › pandas-groupby-apply-method-duplicates-first-group.aspx

Python - Pandas groupby.apply() method duplicates first group

October 21, 2022 - However this problem has been solved with the new updates, versions older than v0.25 usually hold this problem of resulting in duplicate rows but the new versions have all the issues fixed and hence we can directly use apply() method to groupby() and print the groupby object.

Note.nkmk.me

note.nkmk.me › home › python › pandas

pandas: Find, count, drop duplicates (duplicated, drop_duplicates) | note.nkmk.me

January 26, 2024 - Use groupby() to aggregate values based on duplicates. In the following examples, the average values of the numeric columns (age and point) are calculated for duplicated values in the state column. df = pd.read_csv('data/src/sample_pandas_n...

GitHub

github.com › dask › dask › issues › 4592

DataFrame groupby apply returns unexpected duplicated MultiIndex · Issue #4592 · dask/dask

March 14, 2019 - DataFrame groupby apply returns unexpected duplicated MultiIndex#4592 · #4771 · Copy link · Labels · dataframe · bchu · opened · on Mar 14, 2019 · Issue body actions · import pandas as pd import dask.dataframe as dd df = pd.DataFrame([[1, 2], [2, 3], [3, 4]], columns=['a', 'b']) # df = df.astype({'b': 'category'}) df = dd.from_pandas(df, npartitions=1).set_index('a') df = df.groupby('a').apply(lambda d: d.copy()) df.compute() returns a dataframe with a MultiIndex where both levels are just copies of the original index: b a a 1 1 2 2 2 3 3 3 4 ·

Author bchu

stackoverflow.com › questions › 37105609 › drop-duplicates-using-pandas-groupby › 37105655

python - Drop duplicates using pandas groupby - Stack Overflow

You don't need groupby to drop duplicates based on a few columns, you can specify a subset instead:

df2 = df.drop_duplicates(["date", "cid"])
df2.groupby('date').cid.size()
Out[99]: 
date
2005      3
2006     10
2007    227
2008     52
2009    142
2010     57
2011    219
2012     99
2013    238
2014    146
dtype: int64

1. `groupby.head(1)`

The relevant groupby method to drop duplicates in each group is groupby.head(1). Note that it is important to pass 1 to select the first row of each date-cid pair.

df1 = df.groupby(['date', 'cid']).head(1)

2. `duplicated()` is more flexible

Another method is to use duplicated() to create a boolean mask and filter.

df3 = df[~df.duplicated(['date', 'cid'])]

An advantage of this method over drop_duplicates() is that is can be chained with other boolean masks to filter the dataframe more flexibly. For example, to select the unique cids in Nevada for each date, use:

df_nv = df[df['state'].eq('NV') & ~df.duplicated(['date', 'cid'])]

3. `groupby.sample(1)`

Another method to select a unique row from each group to use groupby.sample(). Unlike the previous methods mentioned, it selects a row from each group randomly (whereas the others only keep the first row from each group).

df4 = df.groupby(['date', 'cid']).sample(n=1)

You can verify that df1, df2 (ayhan's output) and df3 all produce the very same output and df4 produces an output where size and nunique of cid match for each date (as required in the OP). In short, the following returns True.

w, x, y, z = [d.groupby('date')['cid'].agg(['size', 'nunique']) for d in (df1, df2, df3, df4)]
w.equals(x) and w.equals(y) and w.equals(z)   # True

and w, x, y, z all look like the following:

       size  nunique
date        
2005      7        3
2006    237       10
2007   3610      227
2008   1318       52
2009   2664      142
2010    997       57
2011   6390      219
2012   2904       99
2013   7875      238
2014   3979      146

stackoverflow.com › questions › 43713731 › pandas-group-by-id-and-drop-duplicate-with-threshold › 43713866

python - Pandas - group by id and drop duplicate with threshold - Stack Overflow

github.com › pandas-dev › pandas › issues › 26011

1 of 5

You can use duplicated to determine the row level duplicates, then perform a groupby on 'userid' to determine 'userid' level duplicates, then drop accordingly.

To drop without a threshold:

df = df[~df.duplicated(['userid', 'itemid']).groupby(df['userid']).transform('any')]

To drop with a threshold, use keep=False in duplicated, and sum over the Boolean column and compare against your threshold. For example, with a threshold of 3:

df = df[~df.duplicated(['userid', 'itemid'], keep=False).groupby(df['userid']).transform('sum').ge(3)]

The resulting output for no threshold:

   userid  itemid
4       2       1
5       2       2
6       2       3

2 of 5

`filter`

Was made for this. You can pass a function that returns a boolean that determines if the group passed the filter or not.

filter and value_counts
Most generalizable and intuitive

df.groupby('userid').filter(lambda x: x.itemid.value_counts().max() < 2)

filter and is_unique
special case when looking for n < 2

df.groupby('userid').filter(lambda x: x.itemid.is_unique)

   userid  itemid
4       2       1
5       2       2
6       2       3

GitHub

Groupby('colname').nth(0) results in index with duplicate ...

June 4, 2019 - A df.groupby('colname').nth(0) operation should return a new data-frame with an index that contains no duplicates; each unique colname will become a single value in the resulting index.

Find elsewhere

Google Bing Mojeek

stackoverflow.com › questions › 31017163 › pandas-duplicated-vs-groupby-to-flag-all-duplicate-values

python - Pandas duplicated vs groupby to flag all duplicate values - Stack Overflow

marsja.se › home › programming › python › pandas drop_duplicates(): how to drop duplicated rows

1 of 4

The easiest way to "get around" this odd Pandas functionality is to generate a mask using df.duplicated(col_name) | df.duplicated(col_name, take_last=True). The bitwise or means that the series you generate is True for all duplicates.

Follow this with using the indexes to set the values that you from the original name or a new name with the number in fron.

In your case below:

# Generating your DataFrame
df_attachment = pd.DataFrame(index=range(5))
df_attachment['ID'] = [1, 2, 3, 4, 5]
df_attachment['File Name'] = ['Text.csv', 'TEXT.csv', 'unique.csv',
                             'unique2.csv', 'text.csv']
df_attachment['LowerFileName'] = df_attachment['File Name'].str.lower()


# Answer from here, mask generation over two lines for readability
mask = df_attachment.duplicated('LowerFileName')
mask = mask | df_attachment.duplicated('LowerFileName', take_last=True)
df_attachment['Duplicate'] = mask

# New column names if possible
df_attachment['number_name'] = df_attachment['ID'].astype(str) + df_attachment['File Name']

# Set the final unique name column using the mask already generated
df_attachment.loc[mask, 'UniqueFileName'] = df_attachment.loc[mask, 'number_name']
df_attachment.loc[~mask, 'UniqueFileName'] = df_attachment.loc[~mask, 'File Name']

# Drop the intermediate column used
del df_attachment['number_name']

And the final df_attachment:

    ID  File Name   LowerFileName   Duplicate   UniqueFileName
0   1   Text.csv    text.csv    True    1Text.csv
1   2   TEXT.csv    text.csv    True    2TEXT.csv
2   3   unique.csv  unique.csv  False   unique.csv
3   4   unique2.csv unique2.csv False   unique2.csv
4   5   text.csv    text.csv    True    5text.csv

This method uses vectorised pandas operations and indexing so should be quick for any size DataFrame.

EDIT: 2017-03-28

Someone gave this a vote yesterday so I thought I would edit this to say that this has been supported natively by pandas since 0.17.0, see the changes here: http://pandas.pydata.org/pandas-docs/version/0.19.2/whatsnew.html#v0-17-0-october-9-2015

Now you can use the keep argument of drop_duplicates and duplicated and set it to False to mark all duplicates: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html

So above the lines generating duplicated column become:

df_attachment['Duplicate'] = df_attachment.duplicated('LowerFileName', keep=False)

2 of 4

Perhaps the use of groupby together with a lambda expression can achieve your objective:

gb = df.groupby('Lower File Name')['Lower File Name'].count()
duplicates = gb[gb > 1].index.tolist()
df['UniqueFileName'] = \
    df.apply(lambda x: '{0}{1}'.format(x.ID if x['Lower File Name'] in duplicates
                                       else "", x['File Name']), axis=1)

>>> df
   ID    File Name Lower File Name Duplicate   UniqueFileName
0   1     Text.csv        text.csv     False        1Text.csv
1   2     TEXT.csv        text.csv      True        2TEXT.csv
2   3   unique.csv      unique.csv     False      3unique.csv
3   4  unique2.csv     unique2.csv     False  Noneunique2.csv
4   5     text.csv        text.csv      True        5text.csv
5   6   uniquE.csv      unique.csv      True      6uniquE.csv

The lambda expression generates a unique filename per the OP's requirements by prepending File Name with the relevant ID only in the event that the Lower File Name is duplicated (i.e. there is more than one file with the same lower case file name). Otherwise, it just uses the lowercase filename without an ID.

Note that this solution does not use the Duplicate column in the above DataFrame.

Also, wouldn't it be simpler to simply append the ID to the Lower File Name in order to generate a unique name? You wouldn't need the solution above and don't even need to check for duplicates, assuming the ID is unique.

Seaborn Line Plots

Pandas drop_duplicates(): How to Drop Duplicated Rows

August 22, 2023 - Finally, before going on and deleting duplicate rows we can use Pandas groupby() and size() to count the duplicated rows:

Pandas

pandas.pydata.org › docs › reference › api › pandas.DataFrame.duplicated.html

pandas.DataFrame.duplicated — pandas 3.0.2 documentation

Return boolean Series denoting duplicate rows.

Net Informations

net-informations.com › ds › pda › rem.htm

Finding and removing duplicate rows in Pandas DataFrame

Pandas drop_duplicates() returns only the dataframe's unique values, optionally only considering certain columns. Drop all duplicate rows across multiple columns in Python Pandas

stackoverflow.com › questions › 65424824 › using-groupby-and-duplicate-in-pandas

python - Using groupby and duplicate in pandas - Stack Overflow

>>> from datar.all import f, tribble, duplicated, distinct, group_by, filter >>> >>> df = tribble( ... f.product, f.trade, f.crop, ... "Fungi", "VIC", "Grapes", ... "ASH", "CAN", "APPLE", ... "FUNGI", "CAN", "SEED", ... "FUNGI", "CAN", "SEED", ... ) >>> >>> df >> group_by(f.product, f.crop) >> filter(~duplicated(f.trade)) product trade crop <object> <object> <object> 0 Fungi VIC Grapes 1 ASH CAN APPLE 2 FUNGI CAN SEED [Groups: product, crop (n=3)]

stackoverflow.com › questions › 24049604 › how-to-group-by-multiple-variables-eliminating-duplicates-with-python-pandas

How to “group by” multiple variables, eliminating duplicates, with Python pandas - Stack Overflow

1 of 4

This should be what you are looking for, but I'm not sure if there's an easier way:

In [5]: df.groupby(['locID','userId']).last().groupby(level='locID').size()
Out[5]: 
locID
loc1     3
loc2     1
loc3     2
loc4     1
dtype: int64

Taking the last of each group will remove duplicats

2 of 4

There's a Series (groupby) method just for this: nunique.

In [11]: df  # Note the duplicated row I appended at the end
Out[11]:
   userID locationID
0       1       loc1
1       1       loc2
2       1       loc3
3       2       loc1
4       3       loc4
5       3       loc3
6       3       loc1
7       3       loc1

In [12]: g = df.groupby('locationID')

In [13]: g['userID'].nunique()
Out[13]:
locationID
loc1          3
loc2          1
loc3          2
loc4          1
dtype: int64

stackoverflow.com › questions › 46640945 › grouping-by-multiple-columns-to-find-duplicate-rows-pandas

python - Grouping by multiple columns to find duplicate rows pandas - Stack Overflow

You need duplicated with parameter subset for specify columns for check with keep=False for all duplicates for mask and filter by boolean indexing:

df = df[df.duplicated(subset=['val1','val2'], keep=False)]
print (df)
   id  val1  val2
0   1   1.1   2.2
1   1   1.1   2.2
3   3   8.8   6.2
4   4   1.1   2.2
5   5   8.8   6.2

Detail:

print (df.duplicated(subset=['val1','val2'], keep=False))
0     True
1     True
2    False
3     True
4     True
5     True
dtype: bool

datascience.stackexchange.com › questions › 21902 › how-duplicated-items-can-be-deleted-from-dataframe-in-pandas

Another method is to compute the size of groups and only keep the rows whose group is larger than 1.

msk = df.groupby(['val1', 'val2'])['val1'].transform('size') > 1
df1 = df[msk]

Stack Exchange

How duplicated items can be deleted from dataframe in pandas - Data Science Stack Exchange

1 of 6

The best way would be to use drop_duplicates(). If you have a larger DataFrame and only want those two columns checked, set subset equal to the combined columns you want checked.

df = df.drop_duplicates()

df = df.drop_duplicates(subset=['userid', 'itemid'])

2 of 6

To avoid reassignment, use (inplace = True)

df.drop_duplicates(inplace=True)

This is same as

df = df.drop_duplicates()

stackoverflow.com › questions › 54916478 › how-to-groupby-duplicate-values-but-retain-other-columns-of-the-dataframe

python - how to groupby duplicate values but retain other columns of the dataframe - Stack Overflow

Using groupby.agg

yourdf=df.groupby('id',as_index=False).agg({'interest':','.join,'location':'first'})
yourdf
Out[140]: 
   id interest location
0   1      A,B        X
1   2      A,D        Y
2   3        C        Z

A somewhat clumsy but working solution. Quite similar to what Wen-Ben proposed, except that it works with an arbitrary number of columns, sorts the items before aggregation, and also aggregates locations.

result = df.groupby('id').apply(lambda x:
                                pd.Series({name: ','.join(sorted(set(x[name]))) 
                                          for name in x})).reset_index()
#   id interest location
#0   1      A,B        X
#1   2      A,D        Y
#2   3        C        Z

stackoverflow.com › questions › 53219340 › pandas-drop-duplicates-within-groupby › 53219428

python - Pandas drop duplicates within groupby - Stack Overflow

You're looking for groupby and nunique:

df.groupby('cuisine', sort=False).name.nunique().to_frame('count')

         count
cuisine       
Chinese      1
Indian       2
French       2

Will return the count of unique items per group.

datascience.stackexchange.com › questions › 47023 › pandas-merge-column-duplicate-and-sum-value

Using crosstab

pd.crosstab(df.name,df.cuisine).ne(0).sum()
Out[550]: 
cuisine
 Chinese    1
 French     2
 Indian     2
dtype: int64

Stack Exchange

python - Pandas merge column duplicate and sum value - Data Science Stack Exchange

In another case when you have a dataset with several duplicated columns and you wouldn't want to select them separately use:

df.groupby(by=df.columns, axis=1).sum()