Completely vectorized numpy solution
Here is the code I use. It's not an optimal one (which I'm unable to write with numpy), but still much faster and more reliable than accepted solution
def weighted_quantile(values, quantiles, sample_weight=None,
values_sorted=False, old_style=False):
""" Very close to numpy.percentile, but supports weights.
NOTE: quantiles should be in [0, 1]!
:param values: numpy.array with data
:param quantiles: array-like with many quantiles needed
:param sample_weight: array-like of the same length as `array`
:param values_sorted: bool, if True, then will avoid sorting of
initial array
:param old_style: if True, will correct output to be consistent
with numpy.percentile.
:return: numpy.array with computed quantiles.
"""
values = np.array(values)
quantiles = np.array(quantiles)
if sample_weight is None:
sample_weight = np.ones(len(values))
sample_weight = np.array(sample_weight)
assert np.all(quantiles >= 0) and np.all(quantiles <= 1), \
'quantiles should be in [0, 1]'
if not values_sorted:
sorter = np.argsort(values)
values = values[sorter]
sample_weight = sample_weight[sorter]
weighted_quantiles = np.cumsum(sample_weight) - 0.5 * sample_weight
if old_style:
# To be convenient with numpy.percentile
weighted_quantiles -= weighted_quantiles[0]
weighted_quantiles /= weighted_quantiles[-1]
else:
weighted_quantiles /= np.sum(sample_weight)
return np.interp(quantiles, weighted_quantiles, values)
Examples:
weighted_quantile([1, 2, 9, 3.2, 4], [0.0, 0.5, 1.])
array([ 1. , 3.2, 9. ])
weighted_quantile([1, 2, 9, 3.2, 4], [0.0, 0.5, 1.], sample_weight=[2, 1, 2, 4, 1])
array([ 1. , 3.2, 9. ])
Answer from Alleo on Stack OverflowVideos
Completely vectorized numpy solution
Here is the code I use. It's not an optimal one (which I'm unable to write with numpy), but still much faster and more reliable than accepted solution
def weighted_quantile(values, quantiles, sample_weight=None,
values_sorted=False, old_style=False):
""" Very close to numpy.percentile, but supports weights.
NOTE: quantiles should be in [0, 1]!
:param values: numpy.array with data
:param quantiles: array-like with many quantiles needed
:param sample_weight: array-like of the same length as `array`
:param values_sorted: bool, if True, then will avoid sorting of
initial array
:param old_style: if True, will correct output to be consistent
with numpy.percentile.
:return: numpy.array with computed quantiles.
"""
values = np.array(values)
quantiles = np.array(quantiles)
if sample_weight is None:
sample_weight = np.ones(len(values))
sample_weight = np.array(sample_weight)
assert np.all(quantiles >= 0) and np.all(quantiles <= 1), \
'quantiles should be in [0, 1]'
if not values_sorted:
sorter = np.argsort(values)
values = values[sorter]
sample_weight = sample_weight[sorter]
weighted_quantiles = np.cumsum(sample_weight) - 0.5 * sample_weight
if old_style:
# To be convenient with numpy.percentile
weighted_quantiles -= weighted_quantiles[0]
weighted_quantiles /= weighted_quantiles[-1]
else:
weighted_quantiles /= np.sum(sample_weight)
return np.interp(quantiles, weighted_quantiles, values)
Examples:
weighted_quantile([1, 2, 9, 3.2, 4], [0.0, 0.5, 1.])
array([ 1. , 3.2, 9. ])
weighted_quantile([1, 2, 9, 3.2, 4], [0.0, 0.5, 1.], sample_weight=[2, 1, 2, 4, 1])
array([ 1. , 3.2, 9. ])
This seems to be now implemented in statsmodels
from statsmodels.stats.weightstats import DescrStatsW
wq = DescrStatsW(data=np.array([1, 2, 9, 3.2, 4]), weights=np.array([0.0, 0.5, 1.0, 0.3, 0.5]))
wq.quantile(probs=np.array([0.1, 0.9]), return_pandas=False)
# array([2., 9.])
The DescrStatsW object also has other methods implemented, such as weighted mean, etc. https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.DescrStatsW.html
import numpy as np
your_data = [ 1.7 , 2.2 , 3.9 ]
your_weights = [ 2 , 1 , 5 ]
xw = np.repeat( your_data , your_weights )
You should obtain that your xw is
[ 1.7 , 1.7 , 2.2 , 3.9 , 3.9 , 3.9 , 3.9 , 3.9 ]
Unfortunately numpy doesn't have built in weighted functions for everything, but you can put things together in this way.
For simplicity, I'll assume that interpolation isn't needed, and
that it suffices to find the individual nearest to the
quantile point, where
Suppose that the population consists of individuals, sorted in
ascending order of the values of some attribute. Suppose that there
are
different attribute values, and that
individuals have the
value of the attribute, for
Then
Represent the individual as the centre of a notional
continuous interval
for
Then the entire population occupies the
interval
and the
quantile
point is at
We simplistically replace this with
the nearest integer, rounding down in the ambiguous case when
is an integer. Thus we take the
quantile to be
individual number
for
or number
in the special case
Define the partial sums for
These form a strictly increasing sequence
where
and
For
therefore, there exists a unique positive
integer
such that
That means that the
individual in the population has the
attribute value.
In terms of this function if
is the list of attribute
values sorted into ascending order, then the
quantile
value of the attribute is (ignoring the special case
):
Here's a toy Python 3 module that does the job. I haven't tried it on any large arrays. For all I know, the way I've coded it may use tons of resources. (You'll surely need to recode it anyway, for instance to use interpolation.)
"""Compute quantiles: see https://math.stackexchange.com/q/3721765."""
__all__ = ['weighted']
import math, operator, itertools
class weighted(object):
"""
Structure of repeated attribute values in ascending order.
"""
def __init__(self, x, w):
"""
Create sorted data from unsorted attribute values and their "weights".
"""
self.xs, self.ws = zip(*sorted(zip(x, w), key=operator.itemgetter(0)))
self.subtotals = list(itertools.accumulate(self.ws))
self.N = self.subtotals[-1]
def individual(self, q):
"""
Identify individual member of population nearest to the q'th quantile.
"""
return math.floor(q * self.N) + 1 if q < 1 else self.N
def attribute(self, k):
"""
Compute attribute index of k'th individual member of the population.
"""
for i, M in enumerate(self.subtotals):
if M >= k:
return i
def quantile(self, q):
"""
Compute q'th quantile value of the attribute.
"""
return self.xs[self.attribute(self.individual(q))]
def main():
print('median = {}'.format(weighted([6, 4, 2],[1, 3, 5]).quantile(.5)))
if __name__ == '__main__':
main()
Version 0.2
This is still a toy implementation. In particular, it still might be hugely
inefficient (I haven't given any thought to that question), and it
still hasn't been tested on any large datasets. What is nice about
it is that the new class multilist is obviously capable of
being considerably elaborated. (No doubt I'll tinker with it a lot,
but there isn't likely to be any good reason to post my tinkerings here.)
I'm not sure how to post code in Maths.SE, so the indentation of the code isn't quite consistent.
"""Lists of items with multiplicity, analogous to multisets."""
__all__ = ['individual', 'multilist', 'quantile']
import math, itertools
def individual(q, N):
"""
Number (1 to N) of individual near q'th quantile of population of size N.
"""
return math.floor(q*N) + 1 if q < 1 else N
def quantile(x, q):
"""
Compute the q'th quantile value of the given *sorted* (N.B.!) multilist x.
"""
return x[individual(q, len(x))]
class multilist(object):
"""
List of elements with multiplicity: similar to a multiset, whence the name.
The multiplicity of each element is a positive integer. The purpose of the
multilist is to behave like a list in which each element occurs many times,
without actually having to store all of those occurrences.
"""
def __init__(self, x, w):
"""
Create multilist from list of values and list of their multiplicities.
"""
self.items = x
self.times = w
self.subtotals = list(itertools.accumulate(self.times))
def __len__(self):
"""
Get the number of items in a list with multiplicities.
The syntax needed to call this function is "len(x)", where x is the
name of the multilist.
"""
return self.subtotals[-1]
def __getitem__(self, k):
"""
Find the k'th item in a list with multiplicities.
If the multiplicities are m_1, m_2, ..., m_r (note that Python indices
are 1 less, running from 0 to r - 1), and subtotals M_0, M_1, ..., M_r,
where M_i = m_1 + m_2 + ... + m_i (i = 0, 1, ..., r), then we want the
unique i (but the Python code uses i - 1) such that M_{i-1} < k <= M_i.
The syntax needed to call this function is "x[k]", where x is the name
of the multilist, and 1 <= k <= len(x).
"""
for i, M in enumerate(self.subtotals):
if M >= k:
return self.items[i]
def sorted(self):
"""
Return a sorted copy of the given multilist.
Note on the implementation: by default, 2-tuples in Python are compared
lexicographically, i.e. by the first element, or the second in the case
of a tie; so there is no need for parameter key=operator.itemgetter(0).
"""
return multilist(*zip(*sorted(zip(self.items, self.times))))
def main():
data = multilist([6, 4, 2], [1, 3, 5]).sorted()
print('median = {}'.format(quantile(data, .5)))
if __name__ == '__main__':
main()
» pip install wquantiles