there is even a shorter one :)
print df.groupby('name').describe().unstack(1)
Answer from Andrey Vykhodtsev on Stack OverflowNothing beats one-liner:
In [145]:
print df.groupby('name').describe().reset_index().pivot(index='name', values='score', columns='level_1')
there is even a shorter one :)
print df.groupby('name').describe().unstack(1)
Nothing beats one-liner:
In [145]:
print df.groupby('name').describe().reset_index().pivot(index='name', values='score', columns='level_1')
Define some data
In[1]:
import pandas as pd
import io
data = """
name score
A 1
A 2
A 3
A 4
A 5
B 2
B 4
B 6
B 8
"""
df = pd.read_csv(io.StringIO(data), delimiter='\s+')
print(df)
.
Out[1]:
name score
0 A 1
1 A 2
2 A 3
3 A 4
4 A 5
5 B 2
6 B 4
7 B 6
8 B 8
Solution
A nice approach to this problem uses a generator expression (see footnote) to allow pd.DataFrame() to iterate over the results of groupby, and construct the summary stats dataframe on the fly:
In[2]:
df2 = pd.DataFrame(group.describe().rename(columns={'score':name}).squeeze()
for name, group in df.groupby('name'))
print(df2)
.
Out[2]:
count mean std min 25% 50% 75% max
A 5 3 1.581139 1 2.0 3 4.0 5
B 4 5 2.581989 2 3.5 5 6.5 8
Here the squeeze function is squeezing out a dimension, to convert the one-column group summary stats Dataframe into a Series.
Footnote: A generator expression has the form my_function(a) for a in iterator, or if iterator gives us back two-element tuples, as in the case of groupby: my_function(a,b) for a,b in iterator
You can use groupby.describe:
df.groupby('gender').describe()
Out:
age postTestScore preTestScore
gender
female count 3.000000 3.000000 3.000000
mean 53.666667 73.666667 19.333333
std 18.556221 18.770544 14.571662
min 36.000000 57.000000 3.000000
25% 44.000000 63.500000 13.500000
50% 52.000000 70.000000 24.000000
75% 62.500000 82.000000 27.500000
max 73.000000 94.000000 31.000000
male count 2.000000 2.000000 2.000000
mean 33.000000 43.500000 3.000000
std 12.727922 26.162951 1.414214
min 24.000000 25.000000 2.000000
25% 28.500000 34.250000 2.500000
50% 33.000000 43.500000 3.000000
75% 37.500000 52.750000 3.500000
max 42.000000 62.000000 4.000000
If you want two separate outputs, you could do the following:
df[df.gender == 'male'].describe()
df[df.gender == 'female'].describe()