This indicates df is None when you called drop_duplicates(). Your code did something unexpected prior to what you've excerpted. Answer from RandomCodingStuff on reddit.com
Top answer
1 of 2
55

You don't need groupby to drop duplicates based on a few columns, you can specify a subset instead:

df2 = df.drop_duplicates(["date", "cid"])
df2.groupby('date').cid.size()
Out[99]: 
date
2005      3
2006     10
2007    227
2008     52
2009    142
2010     57
2011    219
2012     99
2013    238
2014    146
dtype: int64
2 of 2
5

1. groupby.head(1)

The relevant groupby method to drop duplicates in each group is groupby.head(1). Note that it is important to pass 1 to select the first row of each date-cid pair.

df1 = df.groupby(['date', 'cid']).head(1)

2. duplicated() is more flexible

Another method is to use duplicated() to create a boolean mask and filter.

df3 = df[~df.duplicated(['date', 'cid'])]

An advantage of this method over drop_duplicates() is that is can be chained with other boolean masks to filter the dataframe more flexibly. For example, to select the unique cids in Nevada for each date, use:

df_nv = df[df['state'].eq('NV') & ~df.duplicated(['date', 'cid'])]

3. groupby.sample(1)

Another method to select a unique row from each group to use groupby.sample(). Unlike the previous methods mentioned, it selects a row from each group randomly (whereas the others only keep the first row from each group).

df4 = df.groupby(['date', 'cid']).sample(n=1)

You can verify that df1, df2 (ayhan's output) and df3 all produce the very same output and df4 produces an output where size and nunique of cid match for each date (as required in the OP). In short, the following returns True.

w, x, y, z = [d.groupby('date')['cid'].agg(['size', 'nunique']) for d in (df1, df2, df3, df4)]
w.equals(x) and w.equals(y) and w.equals(z)   # True

and w, x, y, z all look like the following:

       size  nunique
date        
2005      7        3
2006    237       10
2007   3610      227
2008   1318       52
2009   2664      142
2010    997       57
2011   6390      219
2012   2904       99
2013   7875      238
2014   3979      146
Discussions

DataFrame.drop_duplicates raises an exception
System information OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 19.10 Modin version (modin.__version__): 0.7.1+2.g4f36f23 Python version: Python 3.7.5 Code we can use to reproduc... More on github.com
🌐 github.com
4
February 27, 2020
python - Grouping Dataframe by Multiple Columns, and Then Dropping Duplicates - Stack Overflow
I have a dataframe which looks like this (see table). For simplicity sake I've "aapl" is the only ticker shown. However, the real dataframe has more tickers. ticker year return aapl 1999 ... More on stackoverflow.com
🌐 stackoverflow.com
python - Pandas 'DataFrame' object has no attribute 'unique' - Stack Overflow
Communities for your favorite technologies. Explore all Collectives · Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work More on stackoverflow.com
🌐 stackoverflow.com
python - Pandas 0.13.1 use of groupby( ) with drop_duplicates( ) or dropna ( ) - Stack Overflow
I've just updated from a previous version to Pandas 0.13.1 - happily, this has opened up some options to me. Unhappily, it appears to have caused problems for some of my data wrangling code. I hadn't More on stackoverflow.com
🌐 stackoverflow.com
🌐
GitHub
github.com › pandas-dev › pandas › issues › 11640
BUG AttributeError: 'DataFrameGroupBy' object has no attribute '_obj_with_exclusions' · Issue #11640 · pandas-dev/pandas
November 18, 2015 - In [5]: df.groupby('a').mean() --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-29-a830c6135818> in <module>() ----> 1 df.groupby('a').mean() /home/nicolas/Git/pandas/pandas/core/groupby.py in mean(self) 764 self._set_selection_from_grouper() 765 f = lambda x: x.mean(axis=self.axis) --> 766 return self._python_agg_general(f) 767 768 def median(self): /home/nicolas/Git/pandas/pandas/core/groupby.py in _python_agg_general(self, func, *args, **kwargs) 1245 output[name] = self._try_cast(values[mask], result)
Author   nbonnotte
🌐
GitHub
github.com › modin-project › modin › issues › 1115
DataFrame.drop_duplicates raises an exception · Issue #1115 · modin-project/modin
February 27, 2020 - #import pandas as pd import ray ray.init(huge_pages=False, plasma_directory="/localdisk/gashiman/plasma", memory=1024*1024*1024*200, object_store_memory=1024*1024*1024*200) import modin.pandas as pd df = pd.DataFrame([["one", 1, 10], ["two", 4, 20], ["three", 7, 30]], index=[1, 2, 3], columns=['name', 'max_speed', 'health']) print(df) df1 = df.drop_duplicates("name") print(df1)
Author   gshimansky
🌐
Pandas
pandas.pydata.org › pandas-docs › version › 0.17.1 › generated › pandas.DataFrame.drop_duplicates.html
pandas.DataFrame.drop_duplicates — pandas 0.17.1 documentation
Enter search terms or a module, class or function name · Return DataFrame with duplicate rows removed, optionally only considering certain columns
🌐
Databricks Community
community.databricks.com › t5 › data-engineering › attributeerror-dataframe-object-has-no-attribute › td-p › 61132
AttributeError: 'DataFrame' object has no attribut... - Databricks Community - 61132
February 19, 2024 - Hello, I have some trouble deduplicating rows on the "id" column, with the method "dropDuplicatesWithinWatermark" in a pipeline. When I run this pipeline, I get the error message: "AttributeError: 'DataFrame' object has no attribute 'dropDuplicatesWithinWatermark'" Here is part of the code: @dl...
Find elsewhere
🌐
Stack Overflow
stackoverflow.com › questions › 23255003 › pandas-0-13-1-use-of-groupby-with-drop-duplicates-or-dropna › 27179319
python - Pandas 0.13.1 use of groupby( ) with drop_duplicates( ) or dropna ( ) - Stack Overflow
:-). So the reason this did work previously is that just all methods that are available for dataframes, were also available for GroupBy objects, while now only the ones that are listed explicitly in a whitelist (the ones that make sense) ... Solution: basically, to rethink the problem. As noted in my comment, I don't need to use groupby to drop duplicates, I'd just put them in the same line in my previous code.
Top answer
1 of 3
2

You should use the method drop_duplicates from pandas.

The following should solve your problem.

Yours code:


import pandas as pd

id= [2000,2001,2001,3000,2000,3000,3300,3300,3300,3300]
jtitle = ['job1','job2','job1','job3', 'job3', 'job2', 'job5', 'job5', 'job5', 'job6']
date = ['01/01/2021', '17/02/2018','17/02/2021', '01/01/2021', '25/03/2011', '11/11/2000', '22/01/2022', '15/12/2021', '11/11/2021', '10/09/2021']

data= pd.DataFrame(data=zip(id, jtitle, date), columns= ["id", "jtitle", "date"])
# convert to datetime object
data.date = pd.to_datetime(data.date, dayfirst=True)


Solution:

# subset employees by ID, sort by date and drop duplicates
latest = data.sort_values('date', ascending=False).drop_duplicates(subset=['id'], keep='first').copy()

prev_date = data.sort_values('date', ascending=False).drop_duplicates(subset=['id'], keep='last').copy()
    
# calculate the difference in days
latest['days'] = latest['date'].values -  prev_date['date'].values
print(latest)

Output:

  id jtitle       date      days
3300   job5 2022-01-22  134 days
2001   job1 2021-02-17 1096 days
2000   job1 2021-01-01 3570 days
3000   job3 2021-01-01 7356 days

2 of 3
2

Alternative solution with diff and sum.

data['days'] = data.sort_values('date').groupby('id').date.diff()
data = data.groupby(['id', 'jtitle']).agg({'days': 'sum', 'date': 'first'}).reset_index()

# to filter to only more than 0 days
data[data.days.dt.days > 0]

Result

     id jtitle      days       date
0  2000   job1 3570 days 2021-01-01
1  2001   job1 1096 days 2021-02-17
2  3000   job3 7356 days 2021-01-01
3  3300   job5  134 days 2022-01-22
🌐
Pandas
pandas.pydata.org › pandas-docs › version › 1.1 › reference › groupby.html
GroupBy — pandas 1.1.5 documentation
GroupBy objects are returned by groupby calls: pandas.DataFrame.groupby(), pandas.Series.groupby(), etc. The following methods are available in both SeriesGroupBy and DataFrameGroupBy objects, but may differ slightly, usually in that the DataFrameGroupBy version usually permits the specification of an axis argument, and often an argument indicating whether to restrict application to columns of a specific data type.
🌐
Itsourcecode
itsourcecode.com › home › attributeerror: ‘dataframe’ object has no attribute ‘unique’
Attributeerror: 'dataframe' object has no attribute 'unique'
April 3, 2023 - A DataFrame is returned by this method with duplicate rows removed. In addition, this method can be used to determine the DataFrame with only unique rows. Example: import pandas as pd s_df = pd.DataFrame({'x': [7, 2, 2, 1], 'y': [0, 3, 3, 1]}) unique_df = s_df.drop_duplicates() print(unique_df)
🌐
Spark By {Examples}
sparkbyexamples.com › home › pandas › pandas.dataframe.drop_duplicates() – examples
pandas.DataFrame.drop_duplicates() - Examples - Spark By {Examples}
December 10, 2024 - Following is the syntax of the drop_duplicates() function. It takes subset, keep, inplace and ignore_index as params and returns DataFrame with duplicate rows removed based on the parameters passed. If inplace=True is used, it updates the existing DataFrame object and returns None.
Top answer
1 of 7
30

Try doing this:

week_grouped = df.groupby('week')
week_grouped.sum().reset_index().to_csv('week_grouped.csv')

That'll write the entire dataframe to the file. If you only want those two columns then,

week_grouped = df.groupby('week')
week_grouped.sum().reset_index()[['week', 'count']].to_csv('week_grouped.csv')

Here's a line by line explanation of the original code:

# This creates a "groupby" object (not a dataframe object) 
# and you store it in the week_grouped variable.
week_grouped = df.groupby('week')

# This instructs pandas to sum up all the numeric type columns in each 
# group. This returns a dataframe where each row is the sum of the 
# group's numeric columns. You're not storing this dataframe in your 
# example.
week_grouped.sum() 

# Here you're calling the to_csv method on a groupby object... but
# that object type doesn't have that method. Dataframes have that method. 
# So we should store the previous line's result (a dataframe) into a variable 
# and then call its to_csv method.
week_grouped.to_csv('week_grouped.csv')

# Like this:
summed_weeks = week_grouped.sum()
summed_weeks.to_csv('...')

# Or with less typing simply
week_grouped.sum().to_csv('...')
2 of 7
9

Group By returns key, value pairs where key is the identifier of the group and the value is the group itself, i.e. a subset of an original df that matched the key.

In your example week_grouped = df.groupby('week') is set of groups (pandas.core.groupby.DataFrameGroupBy object) which you can explore in detail as follows:

for k, gr in week_grouped:
    # do your stuff instead of print
    print(k)
    print(type(gr)) # This will output <class 'pandas.core.frame.DataFrame'>
    print(gr)
    # You can save each 'gr' in a csv as follows
    gr.to_csv('{}.csv'.format(k))

Or alternatively you can compute aggregation function on your grouped object

result = week_grouped.sum()
# This will be already one row per key and its aggregation result
result.to_csv('result.csv') 

In your example you need to assign the function result to some variable as by default pandas objects are immutable.

some_variable = week_grouped.sum() 
some_variable.to_csv('week_grouped.csv') # This will work

basically result.csv and week_grouped.csv are meant to be same

🌐
Stack Overflow
stackoverflow.com › questions › tagged › drop-duplicates
Newest 'drop-duplicates' Questions - Stack Overflow
I have a DataFrame that I want to merge and drop only duplicates values based on column name and row. For example, key_x and key_y has the same values in the same row in row 0,3,10,12,15.