You don't need groupby to drop duplicates based on a few columns, you can specify a subset instead:

df2 = df.drop_duplicates(["date", "cid"])
df2.groupby('date').cid.size()
Out[99]: 
date
2005      3
2006     10
2007    227
2008     52
2009    142
2010     57
2011    219
2012     99
2013    238
2014    146
dtype: int64
Answer from user2285236 on Stack Overflow
Top answer
1 of 2
55

You don't need groupby to drop duplicates based on a few columns, you can specify a subset instead:

df2 = df.drop_duplicates(["date", "cid"])
df2.groupby('date').cid.size()
Out[99]: 
date
2005      3
2006     10
2007    227
2008     52
2009    142
2010     57
2011    219
2012     99
2013    238
2014    146
dtype: int64
2 of 2
5

1. groupby.head(1)

The relevant groupby method to drop duplicates in each group is groupby.head(1). Note that it is important to pass 1 to select the first row of each date-cid pair.

df1 = df.groupby(['date', 'cid']).head(1)

2. duplicated() is more flexible

Another method is to use duplicated() to create a boolean mask and filter.

df3 = df[~df.duplicated(['date', 'cid'])]

An advantage of this method over drop_duplicates() is that is can be chained with other boolean masks to filter the dataframe more flexibly. For example, to select the unique cids in Nevada for each date, use:

df_nv = df[df['state'].eq('NV') & ~df.duplicated(['date', 'cid'])]

3. groupby.sample(1)

Another method to select a unique row from each group to use groupby.sample(). Unlike the previous methods mentioned, it selects a row from each group randomly (whereas the others only keep the first row from each group).

df4 = df.groupby(['date', 'cid']).sample(n=1)

You can verify that df1, df2 (ayhan's output) and df3 all produce the very same output and df4 produces an output where size and nunique of cid match for each date (as required in the OP). In short, the following returns True.

w, x, y, z = [d.groupby('date')['cid'].agg(['size', 'nunique']) for d in (df1, df2, df3, df4)]
w.equals(x) and w.equals(y) and w.equals(z)   # True

and w, x, y, z all look like the following:

       size  nunique
date        
2005      7        3
2006    237       10
2007   3610      227
2008   1318       52
2009   2664      142
2010    997       57
2011   6390      219
2012   2904       99
2013   7875      238
2014   3979      146
Discussions

python - Pandas 'DataFrame' object has no attribute 'unique' - Stack Overflow
Communities for your favorite technologies. Explore all Collectives · Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work More on stackoverflow.com
🌐 stackoverflow.com
python - pandas: drop duplicates in groupby 'date' - Stack Overflow
Your 2nd attribute error is simply caused by executing this: ('date').drop_duplicates('cid'), it has nothing to do with pandas. Indeed, the error message is telling you that 'date', a str type object, doesn't have an attribute called drop_duplicates. More on stackoverflow.com
🌐 stackoverflow.com
SeriesGroupBy Object has not Attribute Diff
There was an error while loading. Please reload this page · I have a multi-index DaskDataframe and am unable to compute a simple diff after a groupby operation on the dataframe More on github.com
🌐 github.com
8
December 17, 2018
[FEA] drop_duplicates for Series
Is your feature request related to a problem? Please describe. We seem to have DataFrame.drop_duplicates but not Series.drop_duplicates. It would be nice to have drop_duplicates for series too beca... More on github.com
🌐 github.com
2
July 12, 2019
🌐
GitHub
github.com › modin-project › modin › issues › 1115
DataFrame.drop_duplicates raises an exception · Issue #1115 · modin-project/modin
February 27, 2020 - This may take some time. name max_speed health 1 one 1 10 2 two 4 20 3 three 7 30 Traceback (most recent call last): File "drop_duplicates_test.py", line 13, in <module> df1 = df.drop_duplicates("name") File "/nfs/site/proj/scripting_tools/gashiman/modin/modin/pandas/dataframe.py", line 208, in drop_duplicates subset=subset, keep=keep, inplace=inplace File "/nfs/site/proj/scripting_tools/gashiman/modin/modin/pandas/base.py", line 1120, in drop_duplicates duplicates = self.duplicated(keep=keep, subset=kwargs.get("subset")) File "/nfs/site/proj/scripting_tools/gashiman/modin/modin/pandas/datafra
Author   gshimansky
🌐
GitHub
github.com › dask › dask › issues › 4307
SeriesGroupBy Object has not Attribute Diff · Issue #4307 · dask/dask
December 17, 2018 - I have a multi-index DaskDataframe and am unable to compute a simple diff after a groupby operation on the dataframe. df.groupby('IndexName')['ColName'].diff() ..'SeriesGroupBy' object has no attribute 'diff The Dask Series object has a ...
Author   bgoodman44
🌐
GitHub
github.com › rapidsai › cudf › issues › 2233
[FEA] drop_duplicates for Series · Issue #2233 · rapidsai/cudf
July 12, 2019 - Is your feature request related to a problem? Please describe. We seem to have DataFrame.drop_duplicates but not Series.drop_duplicates. It would be nice to have drop_duplicates for series too because dask.dataframe.categorize() fails du...
Author   galipremsagar
🌐
Pandas
pandas.pydata.org › docs › reference › api › pandas.Series.groupby.html
pandas.Series.groupby — pandas 3.0.2 documentation
Series.groupby(by=None, level=None, *, as_index=True, sort=True, group_keys=True, observed=True, dropna=True)[source]# Group Series using a mapper or by a Series of columns. A groupby operation involves some combination of splitting the object, applying a function, and combining the results.
Find elsewhere
🌐
GitHub
github.com › pandas-dev › pandas › issues › 11640
BUG AttributeError: 'DataFrameGroupBy' object has no attribute '_obj_with_exclusions' · Issue #11640 · pandas-dev/pandas
November 18, 2015 - In [5]: df.groupby('a').mean() --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-29-a830c6135818> in <module>() ----> 1 df.groupby('a').mean() /home/nicolas/Git/pandas/pandas/core/groupby.py in mean(self) 764 self._set_selection_from_grouper() 765 f = lambda x: x.mean(axis=self.axis) --> 766 return self._python_agg_general(f) 767 768 def median(self): /home/nicolas/Git/pandas/pandas/core/groupby.py in _python_agg_general(self, func, *args, **kwargs) 1245 output[name] = self._try_cast(values[mask], result)
Author   nbonnotte
🌐
Databricks Community
community.databricks.com › t5 › data-engineering › attributeerror-dataframe-object-has-no-attribute › td-p › 61132
AttributeError: 'DataFrame' object has no attribut... - Databricks Community - 61132
February 19, 2024 - Hello, I have some trouble deduplicating rows on the "id" column, with the method "dropDuplicatesWithinWatermark" in a pipeline. When I run this pipeline, I get the error message: "AttributeError: 'DataFrame' object has no attribute 'dropDuplicatesWithinWatermark'" Here is part of the code: @dl...
🌐
Pandas
pandas.pydata.org › pandas-docs › version › 0.17.1 › generated › pandas.DataFrame.drop_duplicates.html
pandas.DataFrame.drop_duplicates — pandas 0.17.1 documentation
Enter search terms or a module, class or function name · Return DataFrame with duplicate rows removed, optionally only considering certain columns
🌐
GitHub
github.com › dask › dask › issues › 2952
can't drop duplicated on dask dataframe index · Issue #2952 · dask/dask
December 3, 2017 - I tried using the following code as suggested by jezrael in stackoverflow rxTable[~rxTable.index.to_Series().duplicated()] and got · AttributeError: 'Index' object has no attribute 'to_Series' It worked a few days ago and just stopped, i can't find any difference in the code and data.
Author   thebeancounter
🌐
GitHub
github.com › pandas-dev › pandas › issues › 8623
pd.Catagorical breaks transform, drop_duplicates, iloc · Issue #8623 · pandas-dev/pandas
October 24, 2014 - x=pd.DataFrame([[1,'John P. Doe'],[2,'Jane Dove'],[1,'John P. Doe']], columns=['person_id','person_name']) x['person_name']=pd.Categorical(x.person_name) # doing this breaks transform g=x.groupby(['person_id']) g.transform(lambda x:x) AttributeError: 'ObjectBlock' object has no attribute '_holder' using drop_duplicates inside apply (I often need this): g.apply(lambda x: x.drop_duplicates('person_name')) SystemError: numpy/core/src/multiarray/iterators.c:370: bad argument to internal function ·
Author   kay1793
🌐
GeeksforGeeks
geeksforgeeks.org › python-pandas-series-drop_duplicates
Python | Pandas Series.drop_duplicates() | GeeksforGeeks
February 13, 2019 - The labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Pandas Series.dropna() function return a ne ... Pandas Index.drop_duplicates() function return Index with duplicate values removed in Python.
🌐
Reddit
reddit.com › r/dataanalysis › data analysis in python
r/dataanalysis on Reddit: Data Analysis in Python
December 19, 2022 -

Hello everyone!
I am a newbie at python and I looked up some problems associated with the Data Expo 2009: Airline on time data from the Harvard Dataverse (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HG7NV7).
I am currently working on the following question:

  1. When is the best time of day, day of the week, and time of year to fly to minimize delays?

All libraries are imported and the data is cleared up (empty columns and duplicate rows are dropped).
What I was intending to do is to plot a bar chart with "Months" on the x-axis and "ArrDelay" (arrival delays) on the y-axis.

My code looks the following way (I'm using jupyter notebook):

import pandas as pd 
dataair = pd.read_csv("/Users/issakovakamilla/Desktop/2000.csv.bz2")
dataair.dropna(how='all', axis=1, inplace=True)
dataair
import matplotlib.pyplot as plt
df = pd.DataFrame(dataair)
X = list(df.iloc[:, 0])
Y = list(df.iloc[:, 1])
plt.bar(X, Y, color='g')
plt.title("stats")
plt.xlabel("Month")
plt.ylabel("ArrDelay")
plt.show()

Somehow I don't get a plot - its been executing for 10 minutes now (I get * near input). Could anyone help me with this?