Update 2022-03
This answer by caner using transform looks much better than my original answer!
df['sales'] / df.groupby('state')['sales'].transform('sum')
Thanks to this comment by Paul Rougieux for surfacing it.
Original Answer (2014)
Paul H's answer is right that you will have to make a second groupby object, but you can calculate the percentage in a simpler way -- just groupby the state_office and divide the sales column by its sum. Copying the beginning of Paul H's answer:
# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
100 * x / float(x.sum()))
Returns:
sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508
Answer from exp1orer on Stack OverflowUpdate 2022-03
This answer by caner using transform looks much better than my original answer!
df['sales'] / df.groupby('state')['sales'].transform('sum')
Thanks to this comment by Paul Rougieux for surfacing it.
Original Answer (2014)
Paul H's answer is right that you will have to make a second groupby object, but you can calculate the percentage in a simpler way -- just groupby the state_office and divide the sales column by its sum. Copying the beginning of Paul H's answer:
# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
100 * x / float(x.sum()))
Returns:
sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508
(This solution is inspired from this article: Understanding the Transform Function in Pandas)
I find the following solution to be the simplest(and probably the fastest) using transformation:
Transformation: While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input.
So using transformation, the solution is 1-liner:
df['%'] = 100 * df['sales'] / df.groupby('state')['sales'].transform('sum')
And if you print:
print(df.sort_values(['state', 'office_id']).reset_index(drop=True))
state office_id sales %
0 AZ 2 195197 9.844309
1 AZ 4 877890 44.274352
2 AZ 6 909754 45.881339
3 CA 1 614752 50.415708
4 CA 3 395340 32.421767
5 CA 5 209274 17.162525
6 CO 1 549430 42.659629
7 CO 3 457514 35.522956
8 CO 5 280995 21.817415
9 WA 2 828238 35.696929
10 WA 4 719366 31.004563
11 WA 6 772590 33.298509
Pandas agg percent groupby of Groupby
python - Groupby count, then sum and get the percentage - Code Review Stack Exchange
ENH: Easily calculate percentages of your df or of groups (i.e. normalize)
Pandas group by column find percentage of count in each group
Hello,
Using pandas, I try to calculate percent by row foreach subgroup of Col1.
Below an example I want to get :
COL1 COL2 AGG1A Test1 30%Test 2 70%B Test 5 10%Test 7 90%
For now, I can get a group by with count() foreach row / subrow and subtotal with sidetable or percent but for all the dataframe not 100% for each groupby.
COL1 COL2 AGG1A Test1 13.5Test 2 31.5Subtotal 45
Is it possible to do it in pure Pandas or I need to parse / transform dataframe in Python
Any tips to help me to reach that goal ?
Thanks !
I think your code is already nearly optimal and Pythonic. But there is some little things to improve:
cluster_count.sum()returns you a Series object so if you are working with it outside the Pandas, it is better to specify the column:cluster_count.char.sum(). This way you will get an ordinary Python integer.- Pandas has an ability to manipulate with columns directly so instead of
applyfunction usage you can just write arithmetical operations with column itself:cluster_count.char = cluster_count.char * 100 / cluster_sum(note that this line of code is in-place work).
Here is the final code:
df = pd.DataFrame({'char':['a','b','c','d','e'], 'cluster':[1,1,2,2,2]})
cluster_count=df.groupby('cluster').count()
cluster_sum=sum(cluster_count.char)
cluster_count.char = cluster_count.char * 100 / cluster_sum
Edit 1: You can do the magic even without cluster_sum variable, just in one line of code:
cluster_count.char = cluster_count.char * 100 / cluster_count.char.sum()
But I am not sure about its perfomance (it can probably recalculate the sum for each group).
Just to add in my 2 cents here:
You can approach this with series.value_counts() which has a normalize parameter.
From the docs:
normalize : boolean, default False If True then the object returned will contain the relative frequencies of the unique values.
Using this we can do:
s=df.cluster.value_counts(normalize=True,sort=False).mul(100) # mul(100) is == *100
s.index.name,s.name='cluster','percentage_' #setting the name of index and series
print(s.to_frame()) #series.to_frame() returns a dataframe
percentage_
cluster
1 40.0
2 60.0
How do I calculate the percentage of each sub group?
| Sex | Survived | Total |
|---|---|---|
| Female | 1 | 233 |
| 0 | 81 | |
| Male | 0 | 468 |
| 1 | 109 |
I want to get the percentage of each sub group, like below:
| Sex | Survived | Total | Percentage |
|---|---|---|---|
| Female | 1 | 233 | 74.20% |
| 0 | 81 | 25.80% | |
| Male | 0 | 468 | 81.11% |
| 1 | 109 | 18.89% |
I tried the following but it didnt work:
train_df.groupby('Sex')['Survived'].transform('sum')