Use value_counts with normalize=True:
df['gender'].value_counts(normalize=True) * 100
The result is a fraction in range (0, 1]. We multiply by 100 here in order to get the %.
Answer from coldspeed95 on Stack OverflowUse value_counts with normalize=True:
df['gender'].value_counts(normalize=True) * 100
The result is a fraction in range (0, 1]. We multiply by 100 here in order to get the %.
If you do not need to look M and F values other than gender column then, may be you can try using value_counts() and count() as following:
df = pd.DataFrame({'gender':['M','M','F', 'F', 'F']})
# Percentage calculation
(df['gender'].value_counts()/df['gender'].count())*100
Result:
F 60.0
M 40.0
Name: gender, dtype: float64
Or, using groupby:
(df.groupby('gender').size()/df['gender'].count())*100
Videos
Update 2022-03
This answer by caner using transform looks much better than my original answer!
df['sales'] / df.groupby('state')['sales'].transform('sum')
Thanks to this comment by Paul Rougieux for surfacing it.
Original Answer (2014)
Paul H's answer is right that you will have to make a second groupby object, but you can calculate the percentage in a simpler way -- just groupby the state_office and divide the sales column by its sum. Copying the beginning of Paul H's answer:
# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
100 * x / float(x.sum()))
Returns:
sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508
(This solution is inspired from this article https://pbpython.com/pandas_transform.html)
I find the following solution to be the simplest(and probably the fastest) using transformation:
Transformation: While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input.
So using transformation, the solution is 1-liner:
df['%'] = 100 * df['sales'] / df.groupby('state')['sales'].transform('sum')
And if you print:
print(df.sort_values(['state', 'office_id']).reset_index(drop=True))
state office_id sales %
0 AZ 2 195197 9.844309
1 AZ 4 877890 44.274352
2 AZ 6 909754 45.881339
3 CA 1 614752 50.415708
4 CA 3 395340 32.421767
5 CA 5 209274 17.162525
6 CO 1 549430 42.659629
7 CO 3 457514 35.522956
8 CO 5 280995 21.817415
9 WA 2 828238 35.696929
10 WA 4 719366 31.004563
11 WA 6 772590 33.298509
You can get the percentages of each column using a lambda function as follows:
>>> df.iloc[:, 3:].apply(lambda x: x / x.sum())
y191 y192 y193 y194 y195
0 0.527231 0.508411 0.490517 0.500544 0.480236
1 0.013305 0.014088 0.013463 0.013631 0.013713
2 0.316116 0.324405 0.341373 0.319164 0.323259
3 0.002006 0.002263 0.002678 0.003206 0.002872
4 0.141342 0.150833 0.151969 0.163455 0.179920
Your example does not have any duplicate values for val_code, so I'm unsure how you want your data to appear (i.e. show percent of total in column vs. total for each vval_code group.)
Ge the total for all the columns of interest and then add the percentage column:
In [35]:
total = np.sum(df.ix[:,'y191':].values)
df['percent'] = df.ix[:,'y191':].sum(axis=1)/total * 100
df
Out[35]:
country_name country_code val_code y191 y192 \
0 United States of America 231 1 47052179 43361966
1 United States of America 231 1 1187385 1201557
2 United States of America 231 1 28211467 27668273
3 United States of America 231 1 179000 193000
4 United States of America 231 1 12613922 12864425
y193 y194 y195 percent
0 42736682 43196916 41751928 50.149471
1 1172941 1176366 1192173 1.363631
2 29742374 27543836 28104317 32.483447
3 233338 276639 249688 0.260213
4 13240395 14106139 15642337 15.743237
So np.sum will sum all the values:
In [32]:
total = np.sum(df.ix[:,'y191':].values)
total
Out[32]:
434899243
We then call .sum(axis=1)/total * 100 on the cols of interest to sum row-wise, divide by the total and multiply by 100 to get a percentage.