Monthly Archives: December 2015

Re-sampling (up and down) in R

In a previous post (Sampling in R) I explained how a simple simulation in R can be performed to simulate student marks which were considered to be non uniformly distributed. This is considered as an unbalanced population when, in this case, referring to the class symbols, e.g “A”.

Furthermore, for each symbol population, a quantitative distribution of the hours spent studying was associated with a symbol.

Fortunately, there is a package caret in R which can be used to address this problem of class imbalances.

Two approaches in which an uniform (balanced) class population can be achieved is by either “down” sampling those populations having more samples in the population or “up” sampling those populations with less elements in the population.

Consider the data frame which was sampled in the previous post (Sampling in R) whose population of marks looked like the following:


table(df$AvgMark)

A B C D E
3000 5000 7000 4000 1000
StudentsBefore And the quantitative distribution (hours of studying) within each mark was represented by: StudentsMarkWithinBefore

Down sampling


library(caret);
set.seed(9560)
down_df <- downSample(x = df[, -ncol(df)], y = df$AvgMark)
down_df <- as.data.frame(down_df)
colnames(down_df) <- c("Hours","AvgMark")

The result is a down sampled uniform distribution of marks

table(down_df$AvgMark)

A B C D E
1000 1000 1000 1000 1000

Also imported here, is that the normally distrusted quantitative distribution (hours of studying) within each population is maintained:

normalAfter

Up sampling

Similarly to down sampling, the same approach can be taken to “up” sample populations with less elements to achieve a uniform distribution as follow:


up_df <- upSample(x = df[, -ncol(df)],y = df$AvgMark)
up_df <- as.data.frame(up_df)
colnames(up_df)  <-c("Hours","AvgMark")

And the result:

table(up_df$AvgMark)

A B C D E
7000 7000 7000 7000 7000

Notice that there are 7000 elements pertaining to each class in comparison to only 1000 elements for the down sampling scenario.