Sampling

When training a machine learning model it’s often the case that some of the outcomes to be predicted or the features/variables associated with the outcomes are non-uniformly distributed. Any many scenarios we often found distributions to be “bell-shaped” like, or more formally normally distributed:

```> x<-seq(-4,4,length=200)
> y<-dnorm(x,mean=0, sd=1)
> plot(x,y, type="l", lwd=2)```

Consider a class of 100 students who were all awarded a mark at the end of the school term from A – E. Let’s say the majority of students obtained a mark “C”. A few students did quite well and achieved an A aggregate while others did not perform that great and achieved an E aggregate and so on…

To perform such a simulation in R we sample from a list of marks with some probability associated with each aggregated mark:

```> marks <- sample(LETTERS[1:5],100,prob=c(0.1,0.2,0.4,0.2,0.1),replace=T)
> marks
[1] "C" "B" "D" "C" "B" "C" "D" "C" "E" "C" "C" "D" "C" "C" "A" "E" "C" "B" "B" "C" "A"
[22] "C" "B" "B" "C" "C" "A" "B" "C" "D" "C" "B" "C" "C" "B" "B" "A" "C" "A" "E" "C" "C"
[43] "A" "C" "B" "E" "D" "C" "A" "C" "C" "B" "C" "B" "D" "D" "D" "B" "C" "B" "B" "A" "B"
[64] "D" "C" "D" "B" "B" "C" "E" "C" "B" "B" "B" "C" "E" "D" "C" "A" "A" "A" "B" "A" "E"
[85] "D" "D" "B" "D" "B" "E" "E" "C" "C" "D" "B" "A" "C" "B" "D" "D"```

Note that I am using LETTERS[1:5] to sample from A(1) to E(5) with probabilities prob=c(0.1,0.2,0.4,0.2,0.1) accordingly and with replacement (replace=T).

The histogram of this sampled space will look as follow

`>barplot(table(marks),col=1:5,xlab="Mark",ylab="Number of students")`

As noted, the majority of students achieved a C aggregate.

2 thoughts on “Sampling”

1. Just to warm up… Does anyone have a suggestion for displaying code blocks on WordPress with line numbers?

Like