Simply add the reset_index() to realign aggregates to a new dataframe.
Additionally, the size() function creates an unmarked 0 column which you can use to filter for duplicate row. Then, just find length of resultant data frame to output a count of duplicates like other functions: drop_duplicates(), duplicated()==True.
data_groups = df.groupby(df.columns.tolist())
size = data_groups.size().reset_index()
size[size[0] > 1] # DATAFRAME OF DUPLICATES
len(size[size[0] > 1]) # NUMBER OF DUPLICATES
Answer from Parfait on Stack OverflowYou don't need groupby to drop duplicates based on a few columns, you can specify a subset instead:
df2 = df.drop_duplicates(["date", "cid"])
df2.groupby('date').cid.size()
Out[99]:
date
2005 3
2006 10
2007 227
2008 52
2009 142
2010 57
2011 219
2012 99
2013 238
2014 146
dtype: int64
1. groupby.head(1)
The relevant groupby method to drop duplicates in each group is groupby.head(1). Note that it is important to pass 1 to select the first row of each date-cid pair.
df1 = df.groupby(['date', 'cid']).head(1)
2. duplicated() is more flexible
Another method is to use duplicated() to create a boolean mask and filter.
df3 = df[~df.duplicated(['date', 'cid'])]
An advantage of this method over drop_duplicates() is that is can be chained with other boolean masks to filter the dataframe more flexibly. For example, to select the unique cids in Nevada for each date, use:
df_nv = df[df['state'].eq('NV') & ~df.duplicated(['date', 'cid'])]
3. groupby.sample(1)
Another method to select a unique row from each group to use groupby.sample(). Unlike the previous methods mentioned, it selects a row from each group randomly (whereas the others only keep the first row from each group).
df4 = df.groupby(['date', 'cid']).sample(n=1)
You can verify that df1, df2 (ayhan's output) and df3 all produce the very same output and df4 produces an output where size and nunique of cid match for each date (as required in the OP). In short, the following returns True.
w, x, y, z = [d.groupby('date')['cid'].agg(['size', 'nunique']) for d in (df1, df2, df3, df4)]
w.equals(x) and w.equals(y) and w.equals(z) # True
and w, x, y, z all look like the following:
size nunique
date
2005 7 3
2006 237 10
2007 3610 227
2008 1318 52
2009 2664 142
2010 997 57
2011 6390 219
2012 2904 99
2013 7875 238
2014 3979 146
You can use duplicated to determine the row level duplicates, then perform a groupby on 'userid' to determine 'userid' level duplicates, then drop accordingly.
To drop without a threshold:
df = df[~df.duplicated(['userid', 'itemid']).groupby(df['userid']).transform('any')]
To drop with a threshold, use keep=False in duplicated, and sum over the Boolean column and compare against your threshold. For example, with a threshold of 3:
df = df[~df.duplicated(['userid', 'itemid'], keep=False).groupby(df['userid']).transform('sum').ge(3)]
The resulting output for no threshold:
userid itemid
4 2 1
5 2 2
6 2 3
filter
Was made for this. You can pass a function that returns a boolean that determines if the group passed the filter or not.
filter and value_counts
Most generalizable and intuitive
df.groupby('userid').filter(lambda x: x.itemid.value_counts().max() < 2)
filter and is_unique
special case when looking for n < 2
df.groupby('userid').filter(lambda x: x.itemid.is_unique)
userid itemid
4 2 1
5 2 2
6 2 3
The easiest way to "get around" this odd Pandas functionality is to generate a mask using df.duplicated(col_name) | df.duplicated(col_name, take_last=True). The bitwise or means that the series you generate is True for all duplicates.
Follow this with using the indexes to set the values that you from the original name or a new name with the number in fron.
In your case below:
# Generating your DataFrame
df_attachment = pd.DataFrame(index=range(5))
df_attachment['ID'] = [1, 2, 3, 4, 5]
df_attachment['File Name'] = ['Text.csv', 'TEXT.csv', 'unique.csv',
'unique2.csv', 'text.csv']
df_attachment['LowerFileName'] = df_attachment['File Name'].str.lower()
# Answer from here, mask generation over two lines for readability
mask = df_attachment.duplicated('LowerFileName')
mask = mask | df_attachment.duplicated('LowerFileName', take_last=True)
df_attachment['Duplicate'] = mask
# New column names if possible
df_attachment['number_name'] = df_attachment['ID'].astype(str) + df_attachment['File Name']
# Set the final unique name column using the mask already generated
df_attachment.loc[mask, 'UniqueFileName'] = df_attachment.loc[mask, 'number_name']
df_attachment.loc[~mask, 'UniqueFileName'] = df_attachment.loc[~mask, 'File Name']
# Drop the intermediate column used
del df_attachment['number_name']
And the final df_attachment:
ID File Name LowerFileName Duplicate UniqueFileName
0 1 Text.csv text.csv True 1Text.csv
1 2 TEXT.csv text.csv True 2TEXT.csv
2 3 unique.csv unique.csv False unique.csv
3 4 unique2.csv unique2.csv False unique2.csv
4 5 text.csv text.csv True 5text.csv
This method uses vectorised pandas operations and indexing so should be quick for any size DataFrame.
EDIT: 2017-03-28
Someone gave this a vote yesterday so I thought I would edit this to say that this has been supported natively by pandas since 0.17.0, see the changes here: http://pandas.pydata.org/pandas-docs/version/0.19.2/whatsnew.html#v0-17-0-october-9-2015
Now you can use the keep argument of drop_duplicates and duplicated and set it to False to mark all duplicates: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html
So above the lines generating duplicated column become:
df_attachment['Duplicate'] = df_attachment.duplicated('LowerFileName', keep=False)
Perhaps the use of groupby together with a lambda expression can achieve your objective:
gb = df.groupby('Lower File Name')['Lower File Name'].count()
duplicates = gb[gb > 1].index.tolist()
df['UniqueFileName'] = \
df.apply(lambda x: '{0}{1}'.format(x.ID if x['Lower File Name'] in duplicates
else "", x['File Name']), axis=1)
>>> df
ID File Name Lower File Name Duplicate UniqueFileName
0 1 Text.csv text.csv False 1Text.csv
1 2 TEXT.csv text.csv True 2TEXT.csv
2 3 unique.csv unique.csv False 3unique.csv
3 4 unique2.csv unique2.csv False Noneunique2.csv
4 5 text.csv text.csv True 5text.csv
5 6 uniquE.csv unique.csv True 6uniquE.csv
The lambda expression generates a unique filename per the OP's requirements by prepending File Name with the relevant ID only in the event that the Lower File Name is duplicated (i.e. there is more than one file with the same lower case file name). Otherwise, it just uses the lowercase filename without an ID.
Note that this solution does not use the Duplicate column in the above DataFrame.
Also, wouldn't it be simpler to simply append the ID to the Lower File Name in order to generate a unique name? You wouldn't need the solution above and don't even need to check for duplicates, assuming the ID is unique.
This should be what you are looking for, but I'm not sure if there's an easier way:
In [5]: df.groupby(['locID','userId']).last().groupby(level='locID').size()
Out[5]:
locID
loc1 3
loc2 1
loc3 2
loc4 1
dtype: int64
Taking the last of each group will remove duplicats
There's a Series (groupby) method just for this: nunique.
In [11]: df # Note the duplicated row I appended at the end
Out[11]:
userID locationID
0 1 loc1
1 1 loc2
2 1 loc3
3 2 loc1
4 3 loc4
5 3 loc3
6 3 loc1
7 3 loc1
In [12]: g = df.groupby('locationID')
In [13]: g['userID'].nunique()
Out[13]:
locationID
loc1 3
loc2 1
loc3 2
loc4 1
dtype: int64
You need duplicated with parameter subset for specify columns for check with keep=False for all duplicates for mask and filter by boolean indexing:
df = df[df.duplicated(subset=['val1','val2'], keep=False)]
print (df)
id val1 val2
0 1 1.1 2.2
1 1 1.1 2.2
3 3 8.8 6.2
4 4 1.1 2.2
5 5 8.8 6.2
Detail:
print (df.duplicated(subset=['val1','val2'], keep=False))
0 True
1 True
2 False
3 True
4 True
5 True
dtype: bool
Another method is to compute the size of groups and only keep the rows whose group is larger than 1.
msk = df.groupby(['val1', 'val2'])['val1'].transform('size') > 1
df1 = df[msk]

The best way would be to use drop_duplicates(). If you have a larger DataFrame and only want those two columns checked, set subset equal to the combined columns you want checked.
df = df.drop_duplicates()
or
df = df.drop_duplicates(subset=['userid', 'itemid'])
To avoid reassignment, use (inplace = True)
df.drop_duplicates(inplace=True)
This is same as
df = df.drop_duplicates()
Using groupby.agg
yourdf=df.groupby('id',as_index=False).agg({'interest':','.join,'location':'first'})
yourdf
Out[140]:
id interest location
0 1 A,B X
1 2 A,D Y
2 3 C Z
A somewhat clumsy but working solution. Quite similar to what Wen-Ben proposed, except that it works with an arbitrary number of columns, sorts the items before aggregation, and also aggregates locations.
result = df.groupby('id').apply(lambda x:
pd.Series({name: ','.join(sorted(set(x[name])))
for name in x})).reset_index()
# id interest location
#0 1 A,B X
#1 2 A,D Y
#2 3 C Z
You're looking for groupby and nunique:
df.groupby('cuisine', sort=False).name.nunique().to_frame('count')
count
cuisine
Chinese 1
Indian 2
French 2
Will return the count of unique items per group.
Using crosstab
pd.crosstab(df.name,df.cuisine).ne(0).sum()
Out[550]:
cuisine
Chinese 1
French 2
Indian 2
dtype: int64
In another case when you have a dataset with several duplicated columns and you wouldn't want to select them separately use:
df.groupby(by=df.columns, axis=1).sum()
You may use
df2 = df.groupby(['address']).sum()
or
df2 = df.groupby(['address']).agg('sum')
If there are columns other than balances that you want to peak only the first or max value, or do mean instead of sum, you can go as follows:
d = {'address': ["A", "A", "B"], 'balances': [30, 40, 50], 'sessions': [2, 3, 4]}
df = pd.DataFrame(data=d)
df2 = df.groupby(['address']).agg({'balances': 'sum', 'sessions': 'mean'})
That outputs
balances sessions
address
A 70 2.5
B 50 4.0
You may add as_index = False to groupby arguments to have:
address balances sessions
0 A 70 2.5
1 B 50 4.0