Simply add the reset_index() to realign aggregates to a new dataframe.

Additionally, the size() function creates an unmarked 0 column which you can use to filter for duplicate row. Then, just find length of resultant data frame to output a count of duplicates like other functions: drop_duplicates(), duplicated()==True.

data_groups = df.groupby(df.columns.tolist())
size = data_groups.size().reset_index() 
size[size[0] > 1]        # DATAFRAME OF DUPLICATES

len(size[size[0] > 1])   # NUMBER OF DUPLICATES
Answer from Parfait on Stack Overflow
🌐
Joel McCune
joelmccune.com › groupby-and-aggregate-using-pandas
GroupBy and Aggregate Using Pandas
April 12, 2021 - df_alias = df.drop(columns=aggregate_column).set_index(groupby_column) agg_df.join(df_alias).reset_index(groupby_column).drop_duplicates(groupby_column).reset_index(drop=True) The output, initially 21 rows, is now only 10. If we want to consolidate this into a succicent function, one easily copied and pasted, here is what it looks like. import pandas as pd def aggregate_column(df: pd.DataFrame, groupby_column: str = 'name', aggregate_column: str = 'data_collection') -> pd.DataFrame: # make sure the columns are in the dataframe assert groupby_column in df.columns, f'"groupby_column", {groupby_column}, ' 'does not appear to be in the input ' 'DataFrame columns.' assert aggregate_column in df.columns, f'"aggregate_column", {aggregate_column}, ' 'does not appear to be in the input ' 'DataFrame columns.'
🌐
Pandas
pandas.pydata.org › docs › user_guide › groupby.html
Group by: split-apply-combine — pandas 2.3.0 documentation
In [13]: def get_letter_type(letter): ....: if letter.lower() in 'aeiou': ....: return 'vowel' ....: else: ....: return 'consonant' ....: In [14]: grouped = df.T.groupby(get_letter_type) pandas Index objects support duplicate values.
🌐
IncludeHelp
includehelp.com › python › pandas-groupby-apply-method-duplicates-first-group.aspx
Python - Pandas groupby.apply() method duplicates first group
October 21, 2022 - However this problem has been solved with the new updates, versions older than v0.25 usually hold this problem of resulting in duplicate rows but the new versions have all the issues fixed and hence we can directly use apply() method to groupby() and print the groupby object.
🌐
Note.nkmk.me
note.nkmk.me › home › python › pandas
pandas: Find, count, drop duplicates (duplicated, drop_duplicates) | note.nkmk.me
January 26, 2024 - Use groupby() to aggregate values based on duplicates. In the following examples, the average values of the numeric columns (age and point) are calculated for duplicated values in the state column. df = pd.read_csv('data/src/sample_pandas_n...
🌐
GitHub
github.com › dask › dask › issues › 4592
DataFrame groupby apply returns unexpected duplicated MultiIndex · Issue #4592 · dask/dask
March 14, 2019 - DataFrame groupby apply returns unexpected duplicated MultiIndex#4592 · #4771 · Copy link · Labels · dataframe · bchu · opened · on Mar 14, 2019 · Issue body actions · import pandas as pd import dask.dataframe as dd df = pd.DataFrame([[1, 2], [2, 3], [3, 4]], columns=['a', 'b']) # df = df.astype({'b': 'category'}) df = dd.from_pandas(df, npartitions=1).set_index('a') df = df.groupby('a').apply(lambda d: d.copy()) df.compute() returns a dataframe with a MultiIndex where both levels are just copies of the original index: b a a 1 1 2 2 2 3 3 3 4 ·
Author   bchu
Top answer
1 of 2
55

You don't need groupby to drop duplicates based on a few columns, you can specify a subset instead:

df2 = df.drop_duplicates(["date", "cid"])
df2.groupby('date').cid.size()
Out[99]: 
date
2005      3
2006     10
2007    227
2008     52
2009    142
2010     57
2011    219
2012     99
2013    238
2014    146
dtype: int64
2 of 2
5

1. groupby.head(1)

The relevant groupby method to drop duplicates in each group is groupby.head(1). Note that it is important to pass 1 to select the first row of each date-cid pair.

df1 = df.groupby(['date', 'cid']).head(1)

2. duplicated() is more flexible

Another method is to use duplicated() to create a boolean mask and filter.

df3 = df[~df.duplicated(['date', 'cid'])]

An advantage of this method over drop_duplicates() is that is can be chained with other boolean masks to filter the dataframe more flexibly. For example, to select the unique cids in Nevada for each date, use:

df_nv = df[df['state'].eq('NV') & ~df.duplicated(['date', 'cid'])]

3. groupby.sample(1)

Another method to select a unique row from each group to use groupby.sample(). Unlike the previous methods mentioned, it selects a row from each group randomly (whereas the others only keep the first row from each group).

df4 = df.groupby(['date', 'cid']).sample(n=1)

You can verify that df1, df2 (ayhan's output) and df3 all produce the very same output and df4 produces an output where size and nunique of cid match for each date (as required in the OP). In short, the following returns True.

w, x, y, z = [d.groupby('date')['cid'].agg(['size', 'nunique']) for d in (df1, df2, df3, df4)]
w.equals(x) and w.equals(y) and w.equals(z)   # True

and w, x, y, z all look like the following:

       size  nunique
date        
2005      7        3
2006    237       10
2007   3610      227
2008   1318       52
2009   2664      142
2010    997       57
2011   6390      219
2012   2904       99
2013   7875      238
2014   3979      146
🌐
GitHub
github.com › pandas-dev › pandas › issues › 26011
Groupby('colname').nth(0) results in index with duplicate ...
June 4, 2019 - A df.groupby('colname').nth(0) operation should return a new data-frame with an index that contains no duplicates; each unique colname will become a single value in the resulting index.
Find elsewhere
Top answer
1 of 4
2

The easiest way to "get around" this odd Pandas functionality is to generate a mask using df.duplicated(col_name) | df.duplicated(col_name, take_last=True). The bitwise or means that the series you generate is True for all duplicates.

Follow this with using the indexes to set the values that you from the original name or a new name with the number in fron.

In your case below:

# Generating your DataFrame
df_attachment = pd.DataFrame(index=range(5))
df_attachment['ID'] = [1, 2, 3, 4, 5]
df_attachment['File Name'] = ['Text.csv', 'TEXT.csv', 'unique.csv',
                             'unique2.csv', 'text.csv']
df_attachment['LowerFileName'] = df_attachment['File Name'].str.lower()


# Answer from here, mask generation over two lines for readability
mask = df_attachment.duplicated('LowerFileName')
mask = mask | df_attachment.duplicated('LowerFileName', take_last=True)
df_attachment['Duplicate'] = mask

# New column names if possible
df_attachment['number_name'] = df_attachment['ID'].astype(str) + df_attachment['File Name']

# Set the final unique name column using the mask already generated
df_attachment.loc[mask, 'UniqueFileName'] = df_attachment.loc[mask, 'number_name']
df_attachment.loc[~mask, 'UniqueFileName'] = df_attachment.loc[~mask, 'File Name']

# Drop the intermediate column used
del df_attachment['number_name']

And the final df_attachment:

    ID  File Name   LowerFileName   Duplicate   UniqueFileName
0   1   Text.csv    text.csv    True    1Text.csv
1   2   TEXT.csv    text.csv    True    2TEXT.csv
2   3   unique.csv  unique.csv  False   unique.csv
3   4   unique2.csv unique2.csv False   unique2.csv
4   5   text.csv    text.csv    True    5text.csv

This method uses vectorised pandas operations and indexing so should be quick for any size DataFrame.

EDIT: 2017-03-28

Someone gave this a vote yesterday so I thought I would edit this to say that this has been supported natively by pandas since 0.17.0, see the changes here: http://pandas.pydata.org/pandas-docs/version/0.19.2/whatsnew.html#v0-17-0-october-9-2015

Now you can use the keep argument of drop_duplicates and duplicated and set it to False to mark all duplicates: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html

So above the lines generating duplicated column become:

df_attachment['Duplicate'] = df_attachment.duplicated('LowerFileName', keep=False)

2 of 4
1

Perhaps the use of groupby together with a lambda expression can achieve your objective:

gb = df.groupby('Lower File Name')['Lower File Name'].count()
duplicates = gb[gb > 1].index.tolist()
df['UniqueFileName'] = \
    df.apply(lambda x: '{0}{1}'.format(x.ID if x['Lower File Name'] in duplicates
                                       else "", x['File Name']), axis=1)

>>> df
   ID    File Name Lower File Name Duplicate   UniqueFileName
0   1     Text.csv        text.csv     False        1Text.csv
1   2     TEXT.csv        text.csv      True        2TEXT.csv
2   3   unique.csv      unique.csv     False      3unique.csv
3   4  unique2.csv     unique2.csv     False  Noneunique2.csv
4   5     text.csv        text.csv      True        5text.csv
5   6   uniquE.csv      unique.csv      True      6uniquE.csv

The lambda expression generates a unique filename per the OP's requirements by prepending File Name with the relevant ID only in the event that the Lower File Name is duplicated (i.e. there is more than one file with the same lower case file name). Otherwise, it just uses the lowercase filename without an ID.

Note that this solution does not use the Duplicate column in the above DataFrame.

Also, wouldn't it be simpler to simply append the ID to the Lower File Name in order to generate a unique name? You wouldn't need the solution above and don't even need to check for duplicates, assuming the ID is unique.

🌐
Seaborn Line Plots
marsja.se › home › programming › python › pandas drop_duplicates(): how to drop duplicated rows
Pandas drop_duplicates(): How to Drop Duplicated Rows
August 22, 2023 - Finally, before going on and deleting duplicate rows we can use Pandas groupby() and size() to count the duplicated rows:
🌐
Net Informations
net-informations.com › ds › pda › rem.htm
Finding and removing duplicate rows in Pandas DataFrame
Pandas drop_duplicates() returns only the dataframe's unique values, optionally only considering certain columns. Drop all duplicate rows across multiple columns in Python Pandas
🌐
Stack Overflow
stackoverflow.com › questions › 65424824 › using-groupby-and-duplicate-in-pandas
python - Using groupby and duplicate in pandas - Stack Overflow
>>> from datar.all import f, tribble, duplicated, distinct, group_by, filter >>> >>> df = tribble( ... f.product, f.trade, f.crop, ... "Fungi", "VIC", "Grapes", ... "ASH", "CAN", "APPLE", ... "FUNGI", "CAN", "SEED", ... "FUNGI", "CAN", "SEED", ... ) >>> >>> df >> group_by(f.product, f.crop) >> filter(~duplicated(f.trade)) product trade crop <object> <object> <object> 0 Fungi VIC Grapes 1 ASH CAN APPLE 2 FUNGI CAN SEED [Groups: product, crop (n=3)]