It describes the distribution of your data: 50 should be a value that describes „the middle“ of the data, also known as median. 25, 75 is the border of the upper/lower quarter of the data. You can get an idea of how skew your data is. Note that the mean is higher than the median, which means your data is right skewed.
Try:
import pandas as pd
x=[1,2,3,4,5]
x=pd.DataFrame(x)
x.describe()
Answer from Peter on Stack ExchangeIt describes the distribution of your data: 50 should be a value that describes „the middle“ of the data, also known as median. 25, 75 is the border of the upper/lower quarter of the data. You can get an idea of how skew your data is. Note that the mean is higher than the median, which means your data is right skewed.
Try:
import pandas as pd
x=[1,2,3,4,5]
x=pd.DataFrame(x)
x.describe()
First, seemingly, the describe table is not the description of your array x.
then, you need to sort your array (x), then calculate the location of your percentage ( which in .describe method p is 0.25, 0.5 and 0.75),
in your example:
sorted_x = [0.09, 0.1 , 0.14, 0.23, 0.26, 0.29, 0.29, 0.3 , 0.31, 0.34, 0.61, 0.62, 0.63, 0.71, 0.73, 0.79, 0.91, 0.93, 0.93, 0.95]
and the element in the which is located in 25th percentage is achieved when we divide the list to 25 and 75 percent, the shown | is 25% here:
sorted_x = [0.09, 0.1 , 0.14, 0.23, 0.26,**|** 0.29, 0.29, 0.3 , 0.31, 0.34, 0.61, 0.62, 0.63, 0.71, 0.73, 0.79, 0.91, 0.93, 0.93, 0.95]
So the value is calculated as which equals $0.28250000000000003$
In general The percentile gives you the actual data that is located in that percentage of the data (undoubtedly after the array is sorted)
Pandas' describe function internally uses the quantile function. The interpolation parameter of the quantile function determines how the quantile is estimated. The output below shows how you can get 3.75 or 3.5 as the 0.75 quantile based on the interpolation used. linear is the default setting. Please take a look at Pandas' quantile function source code here 1
test = pd.Series([1,2,3,4,5,1,1,1,1,9])
test_series = test[0]
quantile_linear = test.quantile(0.75, interpolation='linear')
print(f'quantile based on linear interpolation: {quantile_linear}')
quantile based on linear interpolation: 3.75
quantile_midpoint = test.quantile(0.75, interpolation='midpoint')
print(f'quantile based on midpoint interpolation: {quantile_midpoint}')
quantile based on midpoint interpolation: 3.5
Percentiles indicate the percentage of scores that fall below a particular value. They tell you where a score stands relative to other scores.
For example: a person height 215 cm is at the 91st percentile, which indicates that his hight is higher than 91 percent of other scores.
Percentiles are a great tool to use when you need to know the position of a value/score respect to a population/data distribution you're considering. Where does a value fall within a distribution of values? While the concept behind percentiles is straight forward, there are different mathematical methods for calculating them.
In your example 50% correspond to the median of the ordered values distribution. In this case the median is calculated between two values: 1 and 2 so the median is calculated (in this case 'cause the number of values is even so the median as to be calculated between the fifth and sixth ordered values ) as the mean between them 1.5.
In the pandas documentation there is information about the computation of quantiles, where a reference to numpy.percentile is made:
Return value at the given quantile, a la numpy.percentile.
Then, checking numpy.percentile explanation, we can see that the interpolation method is set to linear by default:
linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j
For your specfic case, the 25th quantile results from:
res_25 = 4 + (6-4)*(3/4) = 5.5
For the 75th quantile we then get:
res_75 = 8 + (10-8)*(1/4) = 8.5
If you set the interpolation method to "midpoint", then you will get the results that you thought of.
.
I think it's easier to understand by seeing this calculation as min+(max-min)*percentile. It has the same result as this function described in NumPy:
linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j
res_25 = 4+(10-4)*percentile = 4+(10-4)*25% = 5.5
res_75 = 4+(10-4)*percentile = 4+(10-4)*75% = 8.5
In simple words...
You will see the percentiles(25%, 50%, 75%..etc) and some values in front of them.
The significance is to tell you the distribution of your data.
For example:
s = pd.Series([1, 2, 3, 1])
s.describe() will give
count 4.000000
mean 1.750000
std 0.957427
min 1.000000
25% 1.000000
50% 1.500000
75% 2.250000
max 3.000000
25% means 25% of your data have the value 1.0000 or below. That is if you were to look at your data manually, 25% of it is less than or equal 1. (you will agree with this if you look at our data [1, 2, 3, 1]. [1] which is 25% of the data is less than or equal to 1.
50% means 50% of your data have the value 1.5 or below. [1, 1] which constitute 50% of the data are less than or equal 1.5.
75% means 75% of your data have the value 2.25 or below. [1, 2, 1] which constitute 75% of the data are less than or equal 2.25.
To interpret the min, 25%, 50%, 75% and max values, imagine sorting each column from lowest to highest value. The first (smallest) value is the min. If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced "25th percentile"). The 50th and 75th percentiles are defined analogously, and the max is the largest number.