When training a machine learning model it’s often the case that some of the outcomes to be predicted or the features/variables associated with the outcomes are non-uniformly distributed. Any many scenarios we often found distributions to be “bell-shaped” like, or more formally normally distributed:

> x<-seq(-4,4,length=200) > y<-dnorm(x,mean=0, sd=1) > plot(x,y, type="l", lwd=2)

Let’s start with an elementary example:

Consider a class of 100 students who were all awarded a mark at the end of the school term from A – E. Let’s say the majority of students obtained a mark “C”. A few students did quite well and achieved an A aggregate while others did not perform that great and achieved an E aggregate and so on…

To perform such a simulation in R we sample from a list of marks with some probability associated with each aggregated mark:

> marks <- sample(LETTERS[1:5],100,prob=c(0.1,0.2,0.4,0.2,0.1),replace=T) > marks [1] "C" "B" "D" "C" "B" "C" "D" "C" "E" "C" "C" "D" "C" "C" "A" "E" "C" "B" "B" "C" "A" [22] "C" "B" "B" "C" "C" "A" "B" "C" "D" "C" "B" "C" "C" "B" "B" "A" "C" "A" "E" "C" "C" [43] "A" "C" "B" "E" "D" "C" "A" "C" "C" "B" "C" "B" "D" "D" "D" "B" "C" "B" "B" "A" "B" [64] "D" "C" "D" "B" "B" "C" "E" "C" "B" "B" "B" "C" "E" "D" "C" "A" "A" "A" "B" "A" "E" [85] "D" "D" "B" "D" "B" "E" "E" "C" "C" "D" "B" "A" "C" "B" "D" "D"

Note that I am using LETTERS[1:5] to sample from A(1) to E(5) with probabilitiesÂ prob=c(0.1,0.2,0.4,0.2,0.1) accordingly and with replacement (replace=T).

The histogram of this sampled space will look as follow

`>barplot(table(marks),col=1:5,xlab="Mark",ylab="Number of students")`

As noted, the majority of students achieved a C aggregate.