Hướng dẫn weighted median python numpy

What we can do, if i understood your problem correctly. Is to sum up the observations, dividing by 2 would give us the observation number corresponding to the median. From there we need to figure out what observation this number was.

One trick here, is to calculate the observation sums with np.cumsum. Which gives us a running cumulative sum.

Example:
np.cumsum([1,2,3,4]) -> [ 1, 3, 6, 10]
Each element is the sum of all previously elements and itself. We have 10 observations here. so the mean would be the 5th observation. (We get 5 by dividing the last element by 2).
Now looking at the cumsum result, we can easily see that that must be the observation between the second and third elements (observation 3 and 6).

So all we need to do, is figure out the index of where the median (5) will fit.
np.searchsorted does exactly what we need. It will find the index to insert an elements into an array, so that it stays sorted.

The code to do it like so:

import numpy as np
#my test data
freq_count = np.array([[30, 191, 9, 0], [10, 20, 300, 10], [10,20,30,40], [100,10,10,10], [1,1,1,100]])

c = np.cumsum(freq_count, axis=1) 
indices = [np.searchsorted(row, row[-1]/2.0) for row in c]
masses = [i * 10 for i in indices] #Correct if the masses are indeed 0, 10, 20,...

#This is just for explanation.
print "median masses is:",  masses
print freq_count
print np.hstack((c, c[:, -1, np.newaxis]/2.0))

Output will be:

median masses is: [10 20 20  0 30]  
[[ 30 191   9   0]  <- The test data
 [ 10  20 300  10]  
 [ 10  20  30  40]  
 [100  10  10  10]  
 [  1   1   1 100]]  
[[  30.   221.   230.   230.   115. ]  <- cumsum results with median added to the end.
 [  10.    30.   330.   340.   170. ]     you can see from this where they fit in.
 [  10.    30.    60.   100.    50. ]  
 [ 100.   110.   120.   130.    65. ]  
 [   1.     2.     3.   103.    51.5]]  

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
calculate a weighted median
@author Jack Peterson ()
"""
from __future__ import division
import numpy as np
def weighted_median(data, weights):
"""
Args:
data (list or numpy.array): data
weights (list or numpy.array): weights
"""
data, weights = np.array(data).squeeze(), np.array(weights).squeeze()
s_data, s_weights = map(np.array, zip(*sorted(zip(data, weights))))
midpoint = 0.5 * sum(s_weights)
if any(weights > midpoint):
w_median = (data[weights == np.max(weights)])[0]
else:
cs_weights = np.cumsum(s_weights)
idx = np.where(cs_weights <= midpoint)[0][-1]
if cs_weights[idx] == midpoint:
w_median = np.mean(s_data[idx:idx+2])
else:
w_median = s_data[idx+1]
return w_median
def test_weighted_median():
data = [
[7, 1, 2, 4, 10],
[7, 1, 2, 4, 10],
[7, 1, 2, 4, 10, 15],
[1, 2, 4, 7, 10, 15],
[0, 10, 20, 30],
[1, 2, 3, 4, 5],
[30, 40, 50, 60, 35],
[2, 0.6, 1.3, 0.3, 0.3, 1.7, 0.7, 1.7, 0.4],
]
weights = [
[1, 1/3, 1/3, 1/3, 1],
[1, 1, 1, 1, 1],
[1, 1/3, 1/3, 1/3, 1, 1],
[1/3, 1/3, 1/3, 1, 1, 1],
[30, 191, 9, 0],
[10, 1, 1, 1, 9],
[1, 3, 5, 4, 2],
[2, 2, 0, 1, 2, 2, 1, 6, 0],
]
answers = [7, 4, 8.5, 8.5, 10, 2.5, 50, 1.7]
for datum, weight, answer in zip(data, weights, answers):
assert(weighted_median(datum, weight) == answer)
if __name__ == "__main__":
test_weighted_median()