There is string 'pct', need variable pct - lambda function by removing '':
aggs = {'B':pct}
print(df.groupby('A').agg(aggs))
B
A
1 0.333333
4 0.333333
7 0.333333
Answer from jezrael on Stack Overflowpython - How to use .mode with groupby - Stack Overflow
'mode' not recognized by df.groupby().agg(), but pd.Series.mode works
Using isin() on grouped data
ENH:AttributeError: 'SeriesGroupBy' object has no attribute 'kurtosis'
Hi,
I want to filter based on whether a value is in another column. However this data needs to be grouped before the isin filter in applied. When I do this I get the error
'SeriesGroupBy' object has no attribute 'isin'
Example explaining what I'm trying to do:
import pandas as pd
dict = {'AttributeName': {0: 'John', 1: 'John', 2: 'John', 3: 'John', 4: 'Sally', 5: 'Sally'}, 'Lineage Step': {0: 1, 1: 2, 2: 3, 3: 4, 4:1, 5:2}, 'From Country': {0: 'Spain', 1: 'Scotland', 2: 'England', 3: 'England', 4: 'Scotland', 5:'England'}, 'From Town': {0: 'Madrid', 1: 'Edinburgh', 2: 'London', 3: 'London', 4: 'Edinburgh', 5: 'Manchester'}, 'FromStreet': {0: 'Spanish St', 1: 'Main St', 2: 'Lower St', 3: 'Middle St', 4: 'London St', 5: 'Scotland St'}, 'ToCountry': {0: 'Scotland', 1: 'England', 2: 'England', 3: 'England', 4: 'England', 5: 'England'}, 'ToTown': {0: 'Edinburgh', 1: 'London', 2: 'London', 3: 'London', 4: 'Liverpool', 5: 'London'}, 'ToStreet': {0: 'Lower St', 1: 'Middle St', 2: 'Upper St', 3: 'Upper St', 4: 'new St', 5: 'Old St'}}
sample_data = pd.DataFrame.from_dict(dict)
#example data set. I want to find every unique 'fromCountry' for both John and Sally. So For John we would just have the first row, where he enters from Spain to Scotland. The second row would be filtered as Scotland appears in the 'ToCountry' column. Sally would just have the 'FromCountry' Edinburgh row. I have tried to do like this:
sample_grouped = sample_data.groupby('AttributeName')
sample_grouped[~sample_grouped['From Country'].isin(sample_grouped['ToCountry'])]but I get there error 'SeriesGroupBy' object has no attribute 'isin'
Does anyone know how to use the isin (or comparable) function on grouped by data?
Thanks
Hello everyone!
I am a newbie at python and I looked up some problems associated with the Data Expo 2009: Airline on time data from the Harvard Dataverse (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HG7NV7).
I am currently working on the following question:
-
When is the best time of day, day of the week, and time of year to fly to minimize delays?
All libraries are imported and the data is cleared up (empty columns and duplicate rows are dropped).
What I was intending to do is to plot a bar chart with "Months" on the x-axis and "ArrDelay" (arrival delays) on the y-axis.
My code looks the following way (I'm using jupyter notebook):
import pandas as pd
dataair = pd.read_csv("/Users/issakovakamilla/Desktop/2000.csv.bz2")
dataair.dropna(how='all', axis=1, inplace=True)
dataair
import matplotlib.pyplot as plt
df = pd.DataFrame(dataair)
X = list(df.iloc[:, 0])
Y = list(df.iloc[:, 1])
plt.bar(X, Y, color='g')
plt.title("stats")
plt.xlabel("Month")
plt.ylabel("ArrDelay")
plt.show()Somehow I don't get a plot - its been executing for 10 minutes now (I get * near input). Could anyone help me with this?
A quick clarification rather than an answer: the meta parameter is used in the .agg() method, to specify the column data types you expect, best expressed as a zero-length pandas dataframe. Dask will supply dummy data to your function otherwise, to try to guess those types, but this doesn't always work.
The issue that you're running into, is that the separate stages of the aggregation can't be the same function applied recursively, as in the custom_sum example that you're looking at.
I've modified code from this answer, leaving comments from @ user8570642, because they are very helpful. Note that this method will solve for a list of groupby keys: https://stackoverflow.com/a/46082075/3968619
def chunk(s):
# for the comments, assume only a single grouping column, the
# implementation can handle multiple group columns.
#
# s is a grouped series. value_counts creates a multi-series like
# (group, value): count
return s.value_counts()
def agg(s):
# print('agg',s.apply(lambda s: s.groupby(level=-1).sum()))
# s is a grouped multi-index series. In .apply the full sub-df will passed
# multi-index and all. Group on the value level and sum the counts. The
# result of the lambda function is a series. Therefore, the result of the
# apply is a multi-index series like (group, value): count
return s.apply(lambda s: s.groupby(level=-1).sum())
# faster version using pandas internals
s = s._selected_obj
return s.groupby(level=list(range(s.index.nlevels))).sum()
def finalize(s):
# s is a multi-index series of the form (group, value): count. First
# manually group on the group part of the index. The lambda will receive a
# sub-series with multi index. Next, drop the group part from the index.
# Finally, determine the index with the maximum value, i.e., the mode.
level = list(range(s.index.nlevels - 1))
return (
s.groupby(level=level)
.apply(lambda s: s.reset_index(level=level, drop=True).idxmax())
)
max_occurence = dd.Aggregation('mode', chunk, agg, finalize)
chunk will count the values for the groupby object in each partition. agg will take the results from chunk and groupy the original groupby command and sum the value counts, so that we have the value counts for every group. finalize will take the multi-index series provided by agg and return the most frequently occurring value of B for each group from Z.
Here's a test case:
df = dd.from_pandas(
pd.DataFrame({"A":[1,1,1,1,2,2,3]*10,"B":[5,5,5,5,1,1,1]*10,
'Z':['mike','amy','amy','amy','chris','chris','sandra']*10}), npartitions=10)
res = df.groupby(['Z']).agg({'B': mode}).compute()
print(res)