Re-sampling (up and down) in R

In a previous post (Sampling in R) I explained how a simple simulation in R can be performed to simulate student marks which were considered to be non uniformly distributed. This is considered as an unbalanced population when, in this case, referring to the class symbols, e.g “A”.

Furthermore, for each symbol population, a quantitative distribution of the hours spent studying was associated with a symbol.

Fortunately, there is a package caret in R which can be used to address this problem of class imbalances.

Two approaches in which an uniform (balanced) class population can be achieved is by either “down” sampling those populations having more samples in the population or “up” sampling those populations with less elements in the population.

Consider the data frame which was sampled in the previous post (Sampling in R) whose population of marks looked like the following:


table(df$AvgMark)

A B C D E
3000 5000 7000 4000 1000
StudentsBefore And the quantitative distribution (hours of studying) within each mark was represented by: StudentsMarkWithinBefore

Down sampling


library(caret);
set.seed(9560)
down_df <- downSample(x = df[, -ncol(df)], y = df$AvgMark)
down_df <- as.data.frame(down_df)
colnames(down_df) <- c("Hours","AvgMark")

The result is a down sampled uniform distribution of marks

table(down_df$AvgMark)

A B C D E
1000 1000 1000 1000 1000

Also imported here, is that the normally distrusted quantitative distribution (hours of studying) within each population is maintained:

normalAfter

Up sampling

Similarly to down sampling, the same approach can be taken to “up” sample populations with less elements to achieve a uniform distribution as follow:


up_df <- upSample(x = df[, -ncol(df)],y = df$AvgMark)
up_df <- as.data.frame(up_df)
colnames(up_df)  <-c("Hours","AvgMark")

And the result:

table(up_df$AvgMark)

A B C D E
7000 7000 7000 7000 7000

Notice that there are 7000 elements pertaining to each class in comparison to only 1000 elements for the down sampling scenario.

Embedded R Shiny apps

In the following week(s) I’ll explain on how to create and deploy an interactive R application (referred to as Shiny) to your blog/website. Here is a simple example were the throw of a dice is simulated once the button is pressed. The number on which the dice “lands” is recorded and added the associated bin of the histogram. Theoretically, the histogram should be almost “flat” after a large number of iterations.

https://etiennekoen.shinyapps.io/rolldice

It seems that iframes are not supported in WordPress.com. For now please click on the link above. After further investigation it seems that iframes, for security reasons, are not supported in WordPress (.com) sites. This can be enabled by hosting the site yourself and installing one of many plugins to enable this functionality for WordPress (.org) blogs.

However, it is possible to embed youtube videos using the short code html tags, even for WordPress.com hostes sites.

Sampling

When training a machine learning model it’s often the case that some of the outcomes to be predicted or the features/variables associated with the outcomes are non-uniformly distributed. Any many scenarios we often found distributions to be “bell-shaped” like, or more formally normally distributed:

> x<-seq(-4,4,length=200)
> y<-dnorm(x,mean=0, sd=1)
> plot(x,y, type="l", lwd=2)

rnorm

Let’s start with an elementary example:

Consider a class of 100 students who were all awarded a mark at the end of the school term from A – E. Let’s say the majority of students obtained a mark “C”. A few students did quite well and achieved an A aggregate while others did not perform that great and achieved an E aggregate and so on…

To perform such a simulation in R we sample from a list of marks with some probability associated with each aggregated mark:

> marks <- sample(LETTERS[1:5],100,prob=c(0.1,0.2,0.4,0.2,0.1),replace=T)
> marks
  [1] "C" "B" "D" "C" "B" "C" "D" "C" "E" "C" "C" "D" "C" "C" "A" "E" "C" "B" "B" "C" "A"
 [22] "C" "B" "B" "C" "C" "A" "B" "C" "D" "C" "B" "C" "C" "B" "B" "A" "C" "A" "E" "C" "C"
 [43] "A" "C" "B" "E" "D" "C" "A" "C" "C" "B" "C" "B" "D" "D" "D" "B" "C" "B" "B" "A" "B"
 [64] "D" "C" "D" "B" "B" "C" "E" "C" "B" "B" "B" "C" "E" "D" "C" "A" "A" "A" "B" "A" "E"
 [85] "D" "D" "B" "D" "B" "E" "E" "C" "C" "D" "B" "A" "C" "B" "D" "D"

Note that I am using LETTERS[1:5] to sample from A(1) to E(5) with probabilities prob=c(0.1,0.2,0.4,0.2,0.1) accordingly and with replacement (replace=T).

The histogram of this sampled space will look as follow

>barplot(table(marks),col=1:5,xlab="Mark",ylab="Number of students")

studentsRaw

As noted, the majority of students achieved a C aggregate.