# Sampling

When training a machine learning model it’s often the case that some of the outcomes to be predicted or the features/variables associated with the outcomes are non-uniformly distributed. Any many scenarios we often found distributions to be “bell-shaped” like, or more formally normally distributed:

```> x<-seq(-4,4,length=200)
> y<-dnorm(x,mean=0, sd=1)
> plot(x,y, type="l", lwd=2)``` Let’s start with an elementary example:

Consider a class of 100 students who were all awarded a mark at the end of the school term from A – E. Let’s say the majority of students obtained a mark “C”. A few students did quite well and achieved an A aggregate while others did not perform that great and achieved an E aggregate and so on…

To perform such a simulation in R we sample from a list of marks with some probability associated with each aggregated mark:

```> marks <- sample(LETTERS[1:5],100,prob=c(0.1,0.2,0.4,0.2,0.1),replace=T)
> marks
 "C" "B" "D" "C" "B" "C" "D" "C" "E" "C" "C" "D" "C" "C" "A" "E" "C" "B" "B" "C" "A"
 "C" "B" "B" "C" "C" "A" "B" "C" "D" "C" "B" "C" "C" "B" "B" "A" "C" "A" "E" "C" "C"
 "A" "C" "B" "E" "D" "C" "A" "C" "C" "B" "C" "B" "D" "D" "D" "B" "C" "B" "B" "A" "B"
 "D" "C" "D" "B" "B" "C" "E" "C" "B" "B" "B" "C" "E" "D" "C" "A" "A" "A" "B" "A" "E"
 "D" "D" "B" "D" "B" "E" "E" "C" "C" "D" "B" "A" "C" "B" "D" "D"```

Note that I am using LETTERS[1:5] to sample from A(1) to E(5) with probabilities prob=c(0.1,0.2,0.4,0.2,0.1) accordingly and with replacement (replace=T).

The histogram of this sampled space will look as follow

`>barplot(table(marks),col=1:5,xlab="Mark",ylab="Number of students")` As noted, the majority of students achieved a C aggregate.

## 2 thoughts on “Sampling”

1. Etienne says:

Just to warm up… Does anyone have a suggestion for displaying code blocks on WordPress with line numbers?

Like