Furthermore, for each symbol population, a quantitative distribution of the hours spent studying was associated with a symbol.

Fortunately, there is a package **caret** in R which can be used to address this problem of class imbalances.

Two approaches in which an uniform (balanced) class population can be achieved is by either “down” sampling those populations having more samples in the population or “up” sampling those populations with less elements in the population.

Consider the data frame which was sampled in the previous post (Sampling in R) whose population of marks looked like the following:

table(df$AvgMark)

*A B C D E*

* 3000 5000 7000 4000 1000*

And the quantitative distribution (hours of studying) within each mark was represented by:

library(caret);

set.seed(9560)

down_df <- downSample(x = df[, -ncol(df)], y = df$AvgMark)

down_df <- as.data.frame(down_df)

colnames(down_df) <- c("Hours","AvgMark")

The result is a down sampled uniform distribution of marks

table(down_df$AvgMark)

*A B C D E*

* 1000 1000 1000 1000 1000*

Also imported here, is that the normally distrusted quantitative distribution (hours of studying) within each population is maintained:

Similarly to down sampling, the same approach can be taken to “up” sample populations with less elements to achieve a uniform distribution as follow:

up_df <- upSample(x = df[, -ncol(df)],y = df$AvgMark)

up_df <- as.data.frame(up_df)

colnames(up_df) <-c("Hours","AvgMark")

And the result:

table(up_df$AvgMark)

*A B C D E*

* 7000 7000 7000 7000 7000*

*Notice that there are 7000 elements pertaining to each class in comparison to only 1000 elements for the down sampling scenario.*

https://etiennekoen.shinyapps.io/rolldice

It seems that iframes are not supported in WordPress.com. For now please click on the link above. After further investigation it seems that iframes, for security reasons, are not supported in WordPress (**.com**) sites. This can be enabled by hosting the site yourself and installing one of many plugins to enable this functionality for WordPress (**.org**) blogs.

However, it is possible to embed youtube videos using the short code html tags, even for WordPress.com hostes sites.

]]>> x<-seq(-4,4,length=200) > y<-dnorm(x,mean=0, sd=1) > plot(x,y, type="l", lwd=2)

Let’s start with an elementary example:

Consider a class of 100 students who were all awarded a mark at the end of the school term from A – E. Let’s say the majority of students obtained a mark “C”. A few students did quite well and achieved an A aggregate while others did not perform that great and achieved an E aggregate and so on…

To perform such a simulation in R we sample from a list of marks with some probability associated with each aggregated mark:

> marks <- sample(LETTERS[1:5],100,prob=c(0.1,0.2,0.4,0.2,0.1),replace=T) > marks [1] "C" "B" "D" "C" "B" "C" "D" "C" "E" "C" "C" "D" "C" "C" "A" "E" "C" "B" "B" "C" "A" [22] "C" "B" "B" "C" "C" "A" "B" "C" "D" "C" "B" "C" "C" "B" "B" "A" "C" "A" "E" "C" "C" [43] "A" "C" "B" "E" "D" "C" "A" "C" "C" "B" "C" "B" "D" "D" "D" "B" "C" "B" "B" "A" "B" [64] "D" "C" "D" "B" "B" "C" "E" "C" "B" "B" "B" "C" "E" "D" "C" "A" "A" "A" "B" "A" "E" [85] "D" "D" "B" "D" "B" "E" "E" "C" "C" "D" "B" "A" "C" "B" "D" "D"

Note that I am using LETTERS[1:5] to sample from A(1) to E(5) with probabilities prob=c(0.1,0.2,0.4,0.2,0.1) accordingly and with replacement (replace=T).

The histogram of this sampled space will look as follow

`>barplot(table(marks),col=1:5,xlab="Mark",ylab="Number of students")`

As noted, the majority of students achieved a C aggregate.

]]>