To find the percentile of a value relative to an array (or in your case a dataframe column), use the scipy function stats.percentileofscore().
For example, if we have a value x (the other numerical value not in the dataframe), and a reference array, arr (the column from the dataframe), we can find the percentile of x by:
from scipy import stats
percentile = stats.percentileofscore(arr, x)
Note that there is a third parameter to the stats.percentileofscore() function that has a significant impact on the resulting value of the percentile, viz. kind. You can choose from rank, weak, strict, and mean. See the docs for more information.
For an example of the difference:
>>> df
a
0 1
1 2
2 3
3 4
4 5
>>> stats.percentileofscore(df['a'], 4, kind='rank')
80.0
>>> stats.percentileofscore(df['a'], 4, kind='weak')
80.0
>>> stats.percentileofscore(df['a'], 4, kind='strict')
60.0
>>> stats.percentileofscore(df['a'], 4, kind='mean')
70.0
As a final note, if you have a value that is greater than 80% of the other values in the column, it would be in the 80th percentile (see the example above for how the kind method affects this final score somewhat) not the 20th percentile. See this Wikipedia article for more information.
To find the percentile of a value relative to an array (or in your case a dataframe column), use the scipy function stats.percentileofscore().
For example, if we have a value x (the other numerical value not in the dataframe), and a reference array, arr (the column from the dataframe), we can find the percentile of x by:
from scipy import stats
percentile = stats.percentileofscore(arr, x)
Note that there is a third parameter to the stats.percentileofscore() function that has a significant impact on the resulting value of the percentile, viz. kind. You can choose from rank, weak, strict, and mean. See the docs for more information.
For an example of the difference:
>>> df
a
0 1
1 2
2 3
3 4
4 5
>>> stats.percentileofscore(df['a'], 4, kind='rank')
80.0
>>> stats.percentileofscore(df['a'], 4, kind='weak')
80.0
>>> stats.percentileofscore(df['a'], 4, kind='strict')
60.0
>>> stats.percentileofscore(df['a'], 4, kind='mean')
70.0
As a final note, if you have a value that is greater than 80% of the other values in the column, it would be in the 80th percentile (see the example above for how the kind method affects this final score somewhat) not the 20th percentile. See this Wikipedia article for more information.
Probably very late but still
df['column_name'].describe()
will give you the regular 25, 50 and 75 percentile with some additional data but if you want percentiles for some specific values then
df['column_name'].describe(percentiles=[0.1, 0.2, 0.3, 0.5])
This will give you 10th, 20th, 30th and 50th percentiles. You can give as many values as you want.
The resulting object can be accessed like a dict:
desc = df['column_name'].describe(percentiles=[0.1, 0.2, 0.3, 0.5])
print(desc)
print(desc['10%'])
python - Find percentile stats of a given column - Stack Overflow
python - Finding the percentile in pandas column - Stack Overflow
python - How do I get the percentile for a row in a pandas dataframe? - Stack Overflow
Percentile range output across multiple columns in python/pandas
- You can use the
pandas.DataFrame.quantile()function.- If you look at the API for
quantile(), you will see it takes an argument for how to do interpolation. If you want a quantile that falls between two positions in your data:- 'linear', 'lower', 'higher', 'midpoint', or 'nearest'.
- By default, it performs linear interpolation.
- These interpolation methods are discussed in the Wikipedia article for percentile
- If you look at the API for
import pandas as pd
import numpy as np
# sample data
np.random.seed(2023) # for reproducibility
data = {'Category': np.random.choice(['hot', 'cold'], size=(10,)),
'field_A': np.random.randint(0, 100, size=(10,)),
'field_B': np.random.randint(0, 100, size=(10,))}
df = pd.DataFrame(data)
df.field_A.mean() # Same as df['field_A'].mean()
# 51.1
df.field_A.median()
# 50.0
# You can call `quantile(i)` to get the i'th quantile,
# where `i` should be a fractional number.
df.field_A.quantile(0.1) # 10th percentile
# 15.6
df.field_A.quantile(0.5) # same as median
# 50.0
df.field_A.quantile(0.9) # 90th percentile
# 88.8
df.groupby('Category').field_A.quantile(0.1)
#Category
#cold 28.8
#hot 8.6
#Name: field_A, dtype: float64
df
Category field_A field_B
0 cold 96 58
1 cold 22 28
2 hot 17 81
3 cold 53 71
4 cold 47 63
5 hot 77 48
6 cold 39 32
7 hot 69 29
8 hot 88 49
9 hot 3 49
assume series s
s = pd.Series(np.arange(100))
Get quantiles for [.1, .2, .3, .4, .5, .6, .7, .8, .9]
s.quantile(np.linspace(.1, 1, 9, 0))
0.1 9.9
0.2 19.8
0.3 29.7
0.4 39.6
0.5 49.5
0.6 59.4
0.7 69.3
0.8 79.2
0.9 89.1
dtype: float64
OR
s.quantile(np.linspace(.1, 1, 9, 0), 'lower')
0.1 9
0.2 19
0.3 29
0.4 39
0.5 49
0.6 59
0.7 69
0.8 79
0.9 89
dtype: int32
TL; DR
Use
sz = temp['INCOME'].size-1
temp['PCNT_LIN'] = temp['INCOME'].rank(method='max').apply(lambda x: 100.0*(x-1)/sz)
INCOME PCNT_LIN
0 78 44.444444
1 38 11.111111
2 42 22.222222
3 48 33.333333
4 31 0.000000
5 89 55.555556
6 94 66.666667
7 102 77.777778
8 122 100.000000
9 122 100.000000
Answer
It is actually very simple, once your understand the mechanics. When you are looking for percentile of a score, you already have the scores in each row. The only step left is understanding that you need percentile of numbers that are less or equal to the selected value. This is exactly what parameters kind='weak' of scipy.stats.percentileofscore() and method='average' of DataFrame.rank() do. In order to invert it, run Series.quantile() with interpolation='lower'.
So, the behavior of the scipy.stats.percentileofscore(), Series.rank() and Series.quantile() is consistent, see below:
In[]:
temp = pd.DataFrame([ 78, 38, 42, 48, 31, 89, 94, 102, 122, 122], columns=['INCOME'])
temp['PCNT_RANK']=temp['INCOME'].rank(method='max', pct=True)
temp['POF'] = temp['INCOME'].apply(lambda x: scipy.stats.percentileofscore(temp['INCOME'], x, kind='weak'))
temp['QUANTILE_VALUE'] = temp['PCNT_RANK'].apply(lambda x: temp['INCOME'].quantile(x, 'lower'))
temp['RANK']=temp['INCOME'].rank(method='max')
sz = temp['RANK'].size - 1
temp['PCNT_LIN'] = temp['RANK'].apply(lambda x: (x-1)/sz)
temp['CHK'] = temp['PCNT_LIN'].apply(lambda x: temp['INCOME'].quantile(x))
temp
Out[]:
INCOME PCNT_RANK POF QUANTILE_VALUE RANK PCNT_LIN CHK
0 78 0.5 50.0 78 5.0 0.444444 78.0
1 38 0.2 20.0 38 2.0 0.111111 38.0
2 42 0.3 30.0 42 3.0 0.222222 42.0
3 48 0.4 40.0 48 4.0 0.333333 48.0
4 31 0.1 10.0 31 1.0 0.000000 31.0
5 89 0.6 60.0 89 6.0 0.555556 89.0
6 94 0.7 70.0 94 7.0 0.666667 94.0
7 102 0.8 80.0 102 8.0 0.777778 102.0
8 122 1.0 100.0 122 10.0 1.000000 122.0
9 122 1.0 100.0 122 10.0 1.000000 122.0
Now in a column PCNT_RANK you get ratio of values that are smaller or equal to the one in a column INCOME. But if you want the "interpolated" ratio, it is in column PCNT_LIN. And as you use Series.rank() for calculations, it is pretty fast and will crunch you 255M numbers in seconds.
Here I will explain how you get the value from using quantile() with linear interpolation:
temp['INCOME'].quantile(0.11)
37.93
Our data temp['INCOME'] has only ten values. According to the formula from your link to Wiki the rank of 11th percentile is
rank = 11*(10-1)/100 + 1 = 1.99
The truncated part of the rank is 1, which corresponds to the value 31, and the value with the rank 2 (i.e. next bin) is 38. The value of fraction is the fractional part of the rank. This leads to the result:
31 + (38-31)*(0.99) = 37.93
For the values themselves, the fraction part have to be zero, so it is very easy to do the inverse calculation to get percentile:
p = (rank - 1)*100/(10 - 1)
I hope I made it more clear.
This seems to work:
A = np.sort(temp['INCOME'].values)
np.interp(sample, A, np.linspace(0, 1, len(A)))
For example:
>>> temp.INCOME.quantile(np.interp([37.5, 38, 122, 121], A, np.linspace(0, 1, len(A))))
0.103175 37.5
0.111111 38.0
1.000000 122.0
0.883333 121.0
Name: INCOME, dtype: float64
Please note that this strategy only makes sense if you want to query a large enough number of values. Otherwise the sorting is too expensive.
I have a dataset, df, where I would like to showcase the 60th, 70th, and 90th percentile values for given values in a column
DATA
type value
Hello 1
Hello 2
Hello 3
Hello 5
Hello 5
Hello 6
Hello 8
Hello 8
Hello 3
OK 1
OK 1
OK 2
OK 2
DESIRED
type 0.6 0.7 0.9
Hello 5 5.6 8
OK 1.8 2 2
DOING
My approach is to utilize the percentile function in numpy:
import numpy as np
print np.percentile(df,60)
print np.percentile(df,70)
print np.percentile(df,90)
This works, however, the output shows these values individually and does not maintain the other columns in the dataset