You don't need groupby to drop duplicates based on a few columns, you can specify a subset instead:
df2 = df.drop_duplicates(["date", "cid"])
df2.groupby('date').cid.size()
Out[99]:
date
2005 3
2006 10
2007 227
2008 52
2009 142
2010 57
2011 219
2012 99
2013 238
2014 146
dtype: int64
Answer from user2285236 on Stack OverflowYou don't need groupby to drop duplicates based on a few columns, you can specify a subset instead:
df2 = df.drop_duplicates(["date", "cid"])
df2.groupby('date').cid.size()
Out[99]:
date
2005 3
2006 10
2007 227
2008 52
2009 142
2010 57
2011 219
2012 99
2013 238
2014 146
dtype: int64
1. groupby.head(1)
The relevant groupby method to drop duplicates in each group is groupby.head(1). Note that it is important to pass 1 to select the first row of each date-cid pair.
df1 = df.groupby(['date', 'cid']).head(1)
2. duplicated() is more flexible
Another method is to use duplicated() to create a boolean mask and filter.
df3 = df[~df.duplicated(['date', 'cid'])]
An advantage of this method over drop_duplicates() is that is can be chained with other boolean masks to filter the dataframe more flexibly. For example, to select the unique cids in Nevada for each date, use:
df_nv = df[df['state'].eq('NV') & ~df.duplicated(['date', 'cid'])]
3. groupby.sample(1)
Another method to select a unique row from each group to use groupby.sample(). Unlike the previous methods mentioned, it selects a row from each group randomly (whereas the others only keep the first row from each group).
df4 = df.groupby(['date', 'cid']).sample(n=1)
You can verify that df1, df2 (ayhan's output) and df3 all produce the very same output and df4 produces an output where size and nunique of cid match for each date (as required in the OP). In short, the following returns True.
w, x, y, z = [d.groupby('date')['cid'].agg(['size', 'nunique']) for d in (df1, df2, df3, df4)]
w.equals(x) and w.equals(y) and w.equals(z) # True
and w, x, y, z all look like the following:
size nunique
date
2005 7 3
2006 237 10
2007 3610 227
2008 1318 52
2009 2664 142
2010 997 57
2011 6390 219
2012 2904 99
2013 7875 238
2014 3979 146
df=df.drop_duplicates(subset=None, keep='first', inplace=False)
it shows:
AttributeError: 'NoneType' object has no attribute 'drop_duplicates'
python - Pandas 'DataFrame' object has no attribute 'unique' - Stack Overflow
python - pandas: drop duplicates in groupby 'date' - Stack Overflow
SeriesGroupBy Object has not Attribute Diff
[FEA] drop_duplicates for Series
DataFrames do not have that method; columns in DataFrames do:
df['A'].unique()
Or, to get the names with the number of observations (using the DataFrame given by closedloop):
>>> df.groupby('person').person.count()
Out[80]:
person
0 2
1 3
Name: person, dtype: int64
Rather than removing duplicates during the pivot table process, use the df.drop_duplicates() function to selectively drop duplicates.
For example if you are pivoting using these index='c0' and columns='c1' then this simple step yields the correct counts.
In this example the 5th row is a duplicate of the 4th (ignoring the non-pivoted c2 column
import pandas as pd
data = {'c0':[0,1,0,1,1], 'c1':[0,0,1,1,1], 'person':[0,0,1,1,1], 'c_other':[1,2,3,4,5]}
df = pd.DataFrame(data)
df2 = df.drop_duplicates(subset=['c0','c1','person'])
pd.pivot_table(df2, index='c0',columns='c1',values='person', aggfunc='count')
This correctly outputs
c1 0 1
c0
0 1 1
1 1 1
Hello everyone!
I am a newbie at python and I looked up some problems associated with the Data Expo 2009: Airline on time data from the Harvard Dataverse (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HG7NV7).
I am currently working on the following question:
-
When is the best time of day, day of the week, and time of year to fly to minimize delays?
All libraries are imported and the data is cleared up (empty columns and duplicate rows are dropped).
What I was intending to do is to plot a bar chart with "Months" on the x-axis and "ArrDelay" (arrival delays) on the y-axis.
My code looks the following way (I'm using jupyter notebook):
import pandas as pd
dataair = pd.read_csv("/Users/issakovakamilla/Desktop/2000.csv.bz2")
dataair.dropna(how='all', axis=1, inplace=True)
dataair
import matplotlib.pyplot as plt
df = pd.DataFrame(dataair)
X = list(df.iloc[:, 0])
Y = list(df.iloc[:, 1])
plt.bar(X, Y, color='g')
plt.title("stats")
plt.xlabel("Month")
plt.ylabel("ArrDelay")
plt.show()Somehow I don't get a plot - its been executing for 10 minutes now (I get * near input). Could anyone help me with this?