You don't need groupby to drop duplicates based on a few columns, you can specify a subset instead:

df2 = df.drop_duplicates(["date", "cid"])
df2.groupby('date').cid.size()
Out[99]: 
date
2005      3
2006     10
2007    227
2008     52
2009    142
2010     57
2011    219
2012     99
2013    238
2014    146
dtype: int64
Answer from user2285236 on Stack Overflow
Top answer
1 of 2
55

You don't need groupby to drop duplicates based on a few columns, you can specify a subset instead:

df2 = df.drop_duplicates(["date", "cid"])
df2.groupby('date').cid.size()
Out[99]: 
date
2005      3
2006     10
2007    227
2008     52
2009    142
2010     57
2011    219
2012     99
2013    238
2014    146
dtype: int64
2 of 2
5

1. groupby.head(1)

The relevant groupby method to drop duplicates in each group is groupby.head(1). Note that it is important to pass 1 to select the first row of each date-cid pair.

df1 = df.groupby(['date', 'cid']).head(1)

2. duplicated() is more flexible

Another method is to use duplicated() to create a boolean mask and filter.

df3 = df[~df.duplicated(['date', 'cid'])]

An advantage of this method over drop_duplicates() is that is can be chained with other boolean masks to filter the dataframe more flexibly. For example, to select the unique cids in Nevada for each date, use:

df_nv = df[df['state'].eq('NV') & ~df.duplicated(['date', 'cid'])]

3. groupby.sample(1)

Another method to select a unique row from each group to use groupby.sample(). Unlike the previous methods mentioned, it selects a row from each group randomly (whereas the others only keep the first row from each group).

df4 = df.groupby(['date', 'cid']).sample(n=1)

You can verify that df1, df2 (ayhan's output) and df3 all produce the very same output and df4 produces an output where size and nunique of cid match for each date (as required in the OP). In short, the following returns True.

w, x, y, z = [d.groupby('date')['cid'].agg(['size', 'nunique']) for d in (df1, df2, df3, df4)]
w.equals(x) and w.equals(y) and w.equals(z)   # True

and w, x, y, z all look like the following:

       size  nunique
date        
2005      7        3
2006    237       10
2007   3610      227
2008   1318       52
2009   2664      142
2010    997       57
2011   6390      219
2012   2904       99
2013   7875      238
2014   3979      146
Discussions

python - Pandas 'DataFrame' object has no attribute 'unique' - Stack Overflow
Communities for your favorite technologies. Explore all Collectives · Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work More on stackoverflow.com
🌐 stackoverflow.com
SeriesGroupBy Object has not Attribute Diff
There was an error while loading. Please reload this page · I have a multi-index DaskDataframe and am unable to compute a simple diff after a groupby operation on the dataframe More on github.com
🌐 github.com
8
December 17, 2018
BUG AttributeError: 'DataFrameGroupBy' object has no attribute '_obj_with_exclusions'
There was an error while loading. Please reload this page · I guess it will be clearer with an example. First, let's prepare the dataframe: More on github.com
🌐 github.com
13
November 18, 2015
[FEA] drop_duplicates for Series
Is your feature request related to a problem? Please describe. We seem to have DataFrame.drop_duplicates but not Series.drop_duplicates. It would be nice to have drop_duplicates for series too beca... More on github.com
🌐 github.com
2
July 12, 2019
🌐
GitHub
github.com › modin-project › modin › issues › 1115
DataFrame.drop_duplicates raises an exception · Issue #1115 · modin-project/modin
February 27, 2020 - This may take some time. name max_speed health 1 one 1 10 2 two 4 20 3 three 7 30 Traceback (most recent call last): File "drop_duplicates_test.py", line 13, in <module> df1 = df.drop_duplicates("name") File "/nfs/site/proj/scripting_tools/gashiman/modin/modin/pandas/dataframe.py", line 208, in drop_duplicates subset=subset, keep=keep, inplace=inplace File "/nfs/site/proj/scripting_tools/gashiman/modin/modin/pandas/base.py", line 1120, in drop_duplicates duplicates = self.duplicated(keep=keep, subset=kwargs.get("subset")) File "/nfs/site/proj/scripting_tools/gashiman/modin/modin/pandas/datafra
Author   gshimansky
🌐
GitHub
github.com › dask › dask › issues › 4307
SeriesGroupBy Object has not Attribute Diff · Issue #4307 · dask/dask
December 17, 2018 - I have a multi-index DaskDataframe and am unable to compute a simple diff after a groupby operation on the dataframe. df.groupby('IndexName')['ColName'].diff() ..'SeriesGroupBy' object has no attribute 'diff The Dask Series object has a ...
Author   bgoodman44
🌐
Pandas
pandas.pydata.org › docs › reference › api › pandas.Series.groupby.html
pandas.Series.groupby — pandas 3.0.2 documentation
Series.groupby(by=None, level=None, *, as_index=True, sort=True, group_keys=True, observed=True, dropna=True)[source]# Group Series using a mapper or by a Series of columns. A groupby operation involves some combination of splitting the object, applying a function, and combining the results.
🌐
GitHub
github.com › pandas-dev › pandas › issues › 11640
BUG AttributeError: 'DataFrameGroupBy' object has no attribute '_obj_with_exclusions' · Issue #11640 · pandas-dev/pandas
November 18, 2015 - In [5]: df.groupby('a').mean() --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-29-a830c6135818> in <module>() ----> 1 df.groupby('a').mean() /home/nicolas/Git/pandas/pandas/core/groupby.py in mean(self) 764 self._set_selection_from_grouper() 765 f = lambda x: x.mean(axis=self.axis) --> 766 return self._python_agg_general(f) 767 768 def median(self): /home/nicolas/Git/pandas/pandas/core/groupby.py in _python_agg_general(self, func, *args, **kwargs) 1245 output[name] = self._try_cast(values[mask], result)
Author   nbonnotte
🌐
GitHub
github.com › rapidsai › cudf › issues › 2233
[FEA] drop_duplicates for Series · Issue #2233 · rapidsai/cudf
July 12, 2019 - Is your feature request related to a problem? Please describe. We seem to have DataFrame.drop_duplicates but not Series.drop_duplicates. It would be nice to have drop_duplicates for series too because dask.dataframe.categorize() fails du...
Author   galipremsagar
Find elsewhere
🌐
Databricks Community
community.databricks.com › t5 › data-engineering › attributeerror-dataframe-object-has-no-attribute › td-p › 61132
AttributeError: 'DataFrame' object has no attribut... - Databricks Community - 61132
February 19, 2024 - Hello, I have some trouble deduplicating rows on the "id" column, with the method "dropDuplicatesWithinWatermark" in a pipeline. When I run this pipeline, I get the error message: "AttributeError: 'DataFrame' object has no attribute 'dropDuplicatesWithinWatermark'" Here is part of the code: @dl...
🌐
Pandas
pandas.pydata.org › pandas-docs › version › 0.17.1 › generated › pandas.DataFrame.drop_duplicates.html
pandas.DataFrame.drop_duplicates — pandas 0.17.1 documentation
Enter search terms or a module, class or function name · Return DataFrame with duplicate rows removed, optionally only considering certain columns
🌐
GitHub
github.com › dask › dask › issues › 2952
can't drop duplicated on dask dataframe index · Issue #2952 · dask/dask
December 3, 2017 - I tried using the following code as suggested by jezrael in stackoverflow rxTable[~rxTable.index.to_Series().duplicated()] and got · AttributeError: 'Index' object has no attribute 'to_Series' It worked a few days ago and just stopped, i can't find any difference in the code and data.
Author   thebeancounter
🌐
GitHub
github.com › pandas-dev › pandas › issues › 8623
pd.Catagorical breaks transform, drop_duplicates, iloc · Issue #8623 · pandas-dev/pandas
October 24, 2014 - x=pd.DataFrame([[1,'John P. Doe'],[2,'Jane Dove'],[1,'John P. Doe']], columns=['person_id','person_name']) x['person_name']=pd.Categorical(x.person_name) # doing this breaks transform g=x.groupby(['person_id']) g.transform(lambda x:x) AttributeError: 'ObjectBlock' object has no attribute '_holder' using drop_duplicates inside apply (I often need this): g.apply(lambda x: x.drop_duplicates('person_name')) SystemError: numpy/core/src/multiarray/iterators.c:370: bad argument to internal function ·
Author   kay1793
🌐
Stack Overflow
stackoverflow.com › questions › 23255003 › pandas-0-13-1-use-of-groupby-with-drop-duplicates-or-dropna › 27179319
python - Pandas 0.13.1 use of groupby( ) with drop_duplicates( ) or dropna ( ) - Stack Overflow
gr8! by definition if you groupby on the duplicated columns the group that is generated doesn't have duplicates! ... haha....you are right! ironically we just had a whole discussion about beefing up the docs for groupby: github.com/pydata/pandas/issues/6944 ... @M.A.Kline There was a small line about this in the release notes of 0.13 (Begin removing methods that don’t make sense on GroupBy objects (GH4887)), but I think it is normal that you didn't find that one ...
🌐
Reddit
reddit.com › r/dataanalysis › data analysis in python
r/dataanalysis on Reddit: Data Analysis in Python
December 19, 2022 -

Hello everyone!
I am a newbie at python and I looked up some problems associated with the Data Expo 2009: Airline on time data from the Harvard Dataverse (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HG7NV7).
I am currently working on the following question:

  1. When is the best time of day, day of the week, and time of year to fly to minimize delays?

All libraries are imported and the data is cleared up (empty columns and duplicate rows are dropped).
What I was intending to do is to plot a bar chart with "Months" on the x-axis and "ArrDelay" (arrival delays) on the y-axis.

My code looks the following way (I'm using jupyter notebook):

import pandas as pd 
dataair = pd.read_csv("/Users/issakovakamilla/Desktop/2000.csv.bz2")
dataair.dropna(how='all', axis=1, inplace=True)
dataair
import matplotlib.pyplot as plt
df = pd.DataFrame(dataair)
X = list(df.iloc[:, 0])
Y = list(df.iloc[:, 1])
plt.bar(X, Y, color='g')
plt.title("stats")
plt.xlabel("Month")
plt.ylabel("ArrDelay")
plt.show()

Somehow I don't get a plot - its been executing for 10 minutes now (I get * near input). Could anyone help me with this?