Since you already know the formula, it should be easy enough to create a function to do the calculation for you.
Here, I've created a basic function to get you started. The function takes four arguments:
frequencies: A vector of frequencies ("number" in your first example)intervals: A 2-rowmatrixwith the same number of columns as the length of frequencies, with the first row being the lower class boundary, and the second row being the upper class boundary. Alternatively, "intervals" may be a column in yourdata.frame, and you may specifysep(and possibly,trim) to have the function automatically create the required matrix for you.sep: The separator character in your "intervals" column in yourdata.frame.trim: A regular expression of characters that need to be removed before trying to coerce to a numeric matrix. One pattern is built into the function:trim = "cut". This sets the regular expression pattern to remove (, ), [, and ] from the input.
Here's the function (with comments showing how I used your instructions to put it together):
GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
# If "sep" is specified, the function will try to create the
# required "intervals" matrix. "trim" removes any unwanted
# characters before attempting to convert the ranges to numeric.
if (!is.null(sep)) {
if (is.null(trim)) pattern <- ""
else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
else pattern <- trim
intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
}
Midpoints <- rowMeans(intervals)
cf <- cumsum(frequencies)
Midrow <- findInterval(max(cf)/2, cf) + 1
L <- intervals[1, Midrow] # lower class boundary of median class
h <- diff(intervals[, Midrow]) # size of median class
f <- frequencies[Midrow] # frequency of median class
cf2 <- cf[Midrow - 1] # cumulative frequency class before median class
n_2 <- max(cf)/2 # total observations divided by 2
unname(L + (n_2 - cf2)/f * h)
}
Here's a sample data.frame to work with:
mydf <- structure(list(salary = c("1500-1600", "1600-1700", "1700-1800",
"1800-1900", "1900-2000", "2000-2100", "2100-2200", "2200-2300",
"2300-2400", "2400-2500"), number = c(110L, 180L, 320L, 460L,
850L, 250L, 130L, 70L, 20L, 10L)), .Names = c("salary", "number"),
class = "data.frame", row.names = c(NA, -10L))
mydf
# salary number
# 1 1500-1600 110
# 2 1600-1700 180
# 3 1700-1800 320
# 4 1800-1900 460
# 5 1900-2000 850
# 6 2000-2100 250
# 7 2100-2200 130
# 8 2200-2300 70
# 9 2300-2400 20
# 10 2400-2500 10
Now, we can simply do:
GroupedMedian(mydf$number, mydf$salary, sep = "-")
# [1] 1915.294
Here's an example of the function in action on some made up data:
set.seed(1)
x <- sample(100, 100, replace = TRUE)
y <- data.frame(table(cut(x, 10)))
y
# Var1 Freq
# 1 (1.9,11.7] 8
# 2 (11.7,21.5] 8
# 3 (21.5,31.4] 8
# 4 (31.4,41.2] 15
# 5 (41.2,51] 13
# 6 (51,60.8] 5
# 7 (60.8,70.6] 11
# 8 (70.6,80.5] 15
# 9 (80.5,90.3] 11
# 10 (90.3,100] 6
### Here's GroupedMedian's output on the grouped data.frame...
GroupedMedian(y$Freq, y$Var1, sep = ",", trim = "cut")
# [1] 49.49231
### ... and the output of median on the original vector
median(x)
# [1] 49.5
By the way, with the sample data that you provided, where I think there was a mistake in one of your ranges (all were separated by dashes except one, which was separated by a comma), since strsplit uses a regular expression by default to split on, you can use the function like this:
x<-c(110,180,320,460,850,250,130,70,20,10)
colnames<-c("numbers")
rownames<-c("[1500-1600]","(1600-1700]","(1700-1800]","(1800-1900]",
"(1900-2000]"," (2000,2100]","(2100-2200]","(2200-2300]",
"(2300-2400]","(2400-2500]")
y<-matrix(x,nrow=length(x),dimnames=list(rownames,colnames))
GroupedMedian(y[, "numbers"], rownames(y), sep="-|,", trim="cut")
# [1] 1915.294
Answer from A5C1D2H2I1M1N2O1R2T1 on Stack Overflowr - how to calculate the median on grouped dataset? - Stack Overflow
Is there a single formula for Calculating Median of Grouped Data for multiple datasets ?
statistics - Derivation of formulas for median and mode for grouped data - Mathematics Stack Exchange
statistics - Derivation of formula for finding median for grouped data - Mathematics Stack Exchange
Why is the median useful in grouped data?
Why do we use the median for grouped data?
How is the median different from mean in grouped data?
Videos
Since you already know the formula, it should be easy enough to create a function to do the calculation for you.
Here, I've created a basic function to get you started. The function takes four arguments:
frequencies: A vector of frequencies ("number" in your first example)intervals: A 2-rowmatrixwith the same number of columns as the length of frequencies, with the first row being the lower class boundary, and the second row being the upper class boundary. Alternatively, "intervals" may be a column in yourdata.frame, and you may specifysep(and possibly,trim) to have the function automatically create the required matrix for you.sep: The separator character in your "intervals" column in yourdata.frame.trim: A regular expression of characters that need to be removed before trying to coerce to a numeric matrix. One pattern is built into the function:trim = "cut". This sets the regular expression pattern to remove (, ), [, and ] from the input.
Here's the function (with comments showing how I used your instructions to put it together):
GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
# If "sep" is specified, the function will try to create the
# required "intervals" matrix. "trim" removes any unwanted
# characters before attempting to convert the ranges to numeric.
if (!is.null(sep)) {
if (is.null(trim)) pattern <- ""
else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
else pattern <- trim
intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
}
Midpoints <- rowMeans(intervals)
cf <- cumsum(frequencies)
Midrow <- findInterval(max(cf)/2, cf) + 1
L <- intervals[1, Midrow] # lower class boundary of median class
h <- diff(intervals[, Midrow]) # size of median class
f <- frequencies[Midrow] # frequency of median class
cf2 <- cf[Midrow - 1] # cumulative frequency class before median class
n_2 <- max(cf)/2 # total observations divided by 2
unname(L + (n_2 - cf2)/f * h)
}
Here's a sample data.frame to work with:
mydf <- structure(list(salary = c("1500-1600", "1600-1700", "1700-1800",
"1800-1900", "1900-2000", "2000-2100", "2100-2200", "2200-2300",
"2300-2400", "2400-2500"), number = c(110L, 180L, 320L, 460L,
850L, 250L, 130L, 70L, 20L, 10L)), .Names = c("salary", "number"),
class = "data.frame", row.names = c(NA, -10L))
mydf
# salary number
# 1 1500-1600 110
# 2 1600-1700 180
# 3 1700-1800 320
# 4 1800-1900 460
# 5 1900-2000 850
# 6 2000-2100 250
# 7 2100-2200 130
# 8 2200-2300 70
# 9 2300-2400 20
# 10 2400-2500 10
Now, we can simply do:
GroupedMedian(mydf$number, mydf$salary, sep = "-")
# [1] 1915.294
Here's an example of the function in action on some made up data:
set.seed(1)
x <- sample(100, 100, replace = TRUE)
y <- data.frame(table(cut(x, 10)))
y
# Var1 Freq
# 1 (1.9,11.7] 8
# 2 (11.7,21.5] 8
# 3 (21.5,31.4] 8
# 4 (31.4,41.2] 15
# 5 (41.2,51] 13
# 6 (51,60.8] 5
# 7 (60.8,70.6] 11
# 8 (70.6,80.5] 15
# 9 (80.5,90.3] 11
# 10 (90.3,100] 6
### Here's GroupedMedian's output on the grouped data.frame...
GroupedMedian(y$Freq, y$Var1, sep = ",", trim = "cut")
# [1] 49.49231
### ... and the output of median on the original vector
median(x)
# [1] 49.5
By the way, with the sample data that you provided, where I think there was a mistake in one of your ranges (all were separated by dashes except one, which was separated by a comma), since strsplit uses a regular expression by default to split on, you can use the function like this:
x<-c(110,180,320,460,850,250,130,70,20,10)
colnames<-c("numbers")
rownames<-c("[1500-1600]","(1600-1700]","(1700-1800]","(1800-1900]",
"(1900-2000]"," (2000,2100]","(2100-2200]","(2200-2300]",
"(2300-2400]","(2400-2500]")
y<-matrix(x,nrow=length(x),dimnames=list(rownames,colnames))
GroupedMedian(y[, "numbers"], rownames(y), sep="-|,", trim="cut")
# [1] 1915.294
I've written it like this to clearly explain how it's being worked out. A more compact version is appended.
library(data.table)
#constructing the dataset with the salary range split into low and high
salarydata <- data.table(
salaries_low = 100*c(15:24),
salaries_high = 100*c(16:25),
numbers = c(110,180,320,460,850,250,130,70,20,10)
)
#calculating cumulative number of observations
salarydata <- salarydata[,cumnumbers := cumsum(numbers)]
salarydata
# salaries_low salaries_high numbers cumnumbers
# 1: 1500 1600 110 110
# 2: 1600 1700 180 290
# 3: 1700 1800 320 610
# 4: 1800 1900 460 1070
# 5: 1900 2000 850 1920
# 6: 2000 2100 250 2170
# 7: 2100 2200 130 2300
# 8: 2200 2300 70 2370
# 9: 2300 2400 20 2390
# 10: 2400 2500 10 2400
#identifying median group
mediangroup <- salarydata[
(cumnumbers - numbers) <= (max(cumnumbers)/2) &
cumnumbers >= (max(cumnumbers)/2)]
mediangroup
# salaries_low salaries_high numbers cumnumbers
# 1: 1900 2000 850 1920
#creating the variables needed to calculate median
mediangroup[,l := salaries_low]
mediangroup[,h := salaries_high - salaries_low]
mediangroup[,f := numbers]
mediangroup[,c := cumnumbers- numbers]
n = salarydata[,sum(numbers)]
#calculating median
median <- mediangroup[,l + ((h/f)*((n/2)-c))]
median
# [1] 1915.294
The compact version -
EDIT: Changed to a function at @AnandaMahto's suggestion. Also, using more general variable names.
library(data.table)
#Creating function
CalculateMedian <- function(
LowerBound,
UpperBound,
Obs
)
{
#calculating cumulative number of observations and n
dataset <- data.table(UpperBound, LowerBound, Obs)
dataset <- dataset[,cumObs := cumsum(Obs)]
n = dataset[,max(cumObs)]
#identifying mediangroup and dynamically calculating l,h,f,c. We already have n.
median <- dataset[
(cumObs - Obs) <= (max(cumObs)/2) &
cumObs >= (max(cumObs)/2),
LowerBound + ((UpperBound - LowerBound)/Obs) * ((n/2) - (cumObs- Obs))
]
return(median)
}
# Using function
CalculateMedian(
LowerBound = 100*c(15:24),
UpperBound = 100*c(16:25),
Obs = c(110,180,320,460,850,250,130,70,20,10)
)
# [1] 1915.294
Each row is a separate dataset (up to 150 rows in a spreadsheet). The columns give the frequency in each group. I can manually find the median class and calculate the median for each row (albeit with some difficulty). But would like to make it a more automatic procedure.
I hope the screen shot below helps.
This formula is the result of a linear interpolation, which identifies the median under the assumption that data are uniformly distributed within the median class.
To derive the formula, we can note that since $N/2$ is the number of observations below the median, then $N/2 - F_{m-1}$ is the number of observations that are within the median class and that are below the median ($F_{m-1}$ is the cumulative frequency of the interval below the median class, i.e. of all classes below the median class).
As a result, the fraction $\displaystyle\frac {N/2 - F_{m-1}}{f_m}$ (where $f_m$ is the frequency of the median class) represents the proportion of data values in the median class that are below the median.
Now if we assume that data are uniformly distributed (i.e., equally spaced) within the median class, multiplying the last fraction by $c$ (total width of the median class) we obtain the fraction of median class width corresponding to the position of the median. Adding the result to $L_m$ (lower limit of the median class), we get the final formula $\displaystyle L_m + \left [ \frac { \frac{N}{2} - F_{m-1} }{f_m} \right ] \cdot c$, which identifies the median.
Suppose you want the fraction value of $4$ boys out of total children $10$ we calculate $\frac{4}{10}$ multiplied by total value .
Similarly we calculate median $=l+x$ , where $x$ is the fraction value of median observation .
Thus we calculate $x = \frac{1}{f}(\frac{n}{2}- cf) รc $