df=df.drop_duplicates(subset=None, keep='first', inplace=False)
it shows:
AttributeError: 'NoneType' object has no attribute 'drop_duplicates'
You don't need groupby to drop duplicates based on a few columns, you can specify a subset instead:
df2 = df.drop_duplicates(["date", "cid"])
df2.groupby('date').cid.size()
Out[99]:
date
2005 3
2006 10
2007 227
2008 52
2009 142
2010 57
2011 219
2012 99
2013 238
2014 146
dtype: int64
1. groupby.head(1)
The relevant groupby method to drop duplicates in each group is groupby.head(1). Note that it is important to pass 1 to select the first row of each date-cid pair.
df1 = df.groupby(['date', 'cid']).head(1)
2. duplicated() is more flexible
Another method is to use duplicated() to create a boolean mask and filter.
df3 = df[~df.duplicated(['date', 'cid'])]
An advantage of this method over drop_duplicates() is that is can be chained with other boolean masks to filter the dataframe more flexibly. For example, to select the unique cids in Nevada for each date, use:
df_nv = df[df['state'].eq('NV') & ~df.duplicated(['date', 'cid'])]
3. groupby.sample(1)
Another method to select a unique row from each group to use groupby.sample(). Unlike the previous methods mentioned, it selects a row from each group randomly (whereas the others only keep the first row from each group).
df4 = df.groupby(['date', 'cid']).sample(n=1)
You can verify that df1, df2 (ayhan's output) and df3 all produce the very same output and df4 produces an output where size and nunique of cid match for each date (as required in the OP). In short, the following returns True.
w, x, y, z = [d.groupby('date')['cid'].agg(['size', 'nunique']) for d in (df1, df2, df3, df4)]
w.equals(x) and w.equals(y) and w.equals(z) # True
and w, x, y, z all look like the following:
size nunique
date
2005 7 3
2006 237 10
2007 3610 227
2008 1318 52
2009 2664 142
2010 997 57
2011 6390 219
2012 2904 99
2013 7875 238
2014 3979 146
DataFrame.drop_duplicates raises an exception
python - Grouping Dataframe by Multiple Columns, and Then Dropping Duplicates - Stack Overflow
python - Pandas 'DataFrame' object has no attribute 'unique' - Stack Overflow
python - Pandas 0.13.1 use of groupby( ) with drop_duplicates( ) or dropna ( ) - Stack Overflow
@QuangHoang provided the simplest version in the comments:
df.drop_duplicates(['ticker', 'year'])
Alternatively, you can use .groupby twice, inside two .applys:
df.groupby("ticker", group_keys=False).apply(lambda x:
x.groupby("year", group_keys=False).apply(lambda x: x.drop_duplicates(['year']))
)
Alternatively, you can use the .duplicated function:
df.groupby('ticker', group_keys=False).apply(lambda x:
x[~x['year'].duplicated(keep='first')])
)
You can try to sort the values first and then groupby.tail
df.sort_values('return').groupby(['ticker','year']).tail(1)
ticker year return
0 aapl 1999 1
1 aapl 2000 3
DataFrames do not have that method; columns in DataFrames do:
df['A'].unique()
Or, to get the names with the number of observations (using the DataFrame given by closedloop):
>>> df.groupby('person').person.count()
Out[80]:
person
0 2
1 3
Name: person, dtype: int64
Rather than removing duplicates during the pivot table process, use the df.drop_duplicates() function to selectively drop duplicates.
For example if you are pivoting using these index='c0' and columns='c1' then this simple step yields the correct counts.
In this example the 5th row is a duplicate of the 4th (ignoring the non-pivoted c2 column
import pandas as pd
data = {'c0':[0,1,0,1,1], 'c1':[0,0,1,1,1], 'person':[0,0,1,1,1], 'c_other':[1,2,3,4,5]}
df = pd.DataFrame(data)
df2 = df.drop_duplicates(subset=['c0','c1','person'])
pd.pivot_table(df2, index='c0',columns='c1',values='person', aggfunc='count')
This correctly outputs
c1 0 1
c0
0 1 1
1 1 1
You should use the method drop_duplicates from pandas.
The following should solve your problem.
Yours code:
import pandas as pd
id= [2000,2001,2001,3000,2000,3000,3300,3300,3300,3300]
jtitle = ['job1','job2','job1','job3', 'job3', 'job2', 'job5', 'job5', 'job5', 'job6']
date = ['01/01/2021', '17/02/2018','17/02/2021', '01/01/2021', '25/03/2011', '11/11/2000', '22/01/2022', '15/12/2021', '11/11/2021', '10/09/2021']
data= pd.DataFrame(data=zip(id, jtitle, date), columns= ["id", "jtitle", "date"])
# convert to datetime object
data.date = pd.to_datetime(data.date, dayfirst=True)
Solution:
# subset employees by ID, sort by date and drop duplicates
latest = data.sort_values('date', ascending=False).drop_duplicates(subset=['id'], keep='first').copy()
prev_date = data.sort_values('date', ascending=False).drop_duplicates(subset=['id'], keep='last').copy()
# calculate the difference in days
latest['days'] = latest['date'].values - prev_date['date'].values
print(latest)
Output:
id jtitle date days
3300 job5 2022-01-22 134 days
2001 job1 2021-02-17 1096 days
2000 job1 2021-01-01 3570 days
3000 job3 2021-01-01 7356 days
Alternative solution with diff and sum.
data['days'] = data.sort_values('date').groupby('id').date.diff()
data = data.groupby(['id', 'jtitle']).agg({'days': 'sum', 'date': 'first'}).reset_index()
# to filter to only more than 0 days
data[data.days.dt.days > 0]
Result
id jtitle days date
0 2000 job1 3570 days 2021-01-01
1 2001 job1 1096 days 2021-02-17
2 3000 job3 7356 days 2021-01-01
3 3300 job5 134 days 2022-01-22
Try doing this:
week_grouped = df.groupby('week')
week_grouped.sum().reset_index().to_csv('week_grouped.csv')
That'll write the entire dataframe to the file. If you only want those two columns then,
week_grouped = df.groupby('week')
week_grouped.sum().reset_index()[['week', 'count']].to_csv('week_grouped.csv')
Here's a line by line explanation of the original code:
# This creates a "groupby" object (not a dataframe object)
# and you store it in the week_grouped variable.
week_grouped = df.groupby('week')
# This instructs pandas to sum up all the numeric type columns in each
# group. This returns a dataframe where each row is the sum of the
# group's numeric columns. You're not storing this dataframe in your
# example.
week_grouped.sum()
# Here you're calling the to_csv method on a groupby object... but
# that object type doesn't have that method. Dataframes have that method.
# So we should store the previous line's result (a dataframe) into a variable
# and then call its to_csv method.
week_grouped.to_csv('week_grouped.csv')
# Like this:
summed_weeks = week_grouped.sum()
summed_weeks.to_csv('...')
# Or with less typing simply
week_grouped.sum().to_csv('...')
Group By returns key, value pairs where key is the identifier of the group and the value is the group itself, i.e. a subset of an original df that matched the key.
In your example week_grouped = df.groupby('week') is set of groups (pandas.core.groupby.DataFrameGroupBy object) which you can explore in detail as follows:
for k, gr in week_grouped:
# do your stuff instead of print
print(k)
print(type(gr)) # This will output <class 'pandas.core.frame.DataFrame'>
print(gr)
# You can save each 'gr' in a csv as follows
gr.to_csv('{}.csv'.format(k))
Or alternatively you can compute aggregation function on your grouped object
result = week_grouped.sum()
# This will be already one row per key and its aggregation result
result.to_csv('result.csv')
In your example you need to assign the function result to some variable as by default pandas objects are immutable.
some_variable = week_grouped.sum()
some_variable.to_csv('week_grouped.csv') # This will work
basically result.csv and week_grouped.csv are meant to be same