Chapter 3: Probability, distribution, parameter estimates and likelihood
Author
Jakub Těšitel
Random variable and probability distribution
Imagine tossing a coin. Before you make a toss, you don’t know the result and you cannot affect the outcome. The set of future outcomes generated by such process is a random variable. Randomness does not mean, that you do not know anything about the possible outcomes of this process. You know the two possible outcomes that can be produced and the expectation of getting one or the other (assuming that the coin is “fair”).
A random variable can thus be described by its properties. This description of the process generating the random variable is then indicative of the expectations of individual future observations – probabilities.
We are not limited by a single observation but can consider a series of them. Then, it makes sense to ask, e.g. what is the probability of getting less than 40 eagles in 100 tosses? If we do not fix the value to 40 but instead study the probabilities for all possible values (here: What is the probability of getting less than 1, 2, 3, …, 100 eagles in 100 tosses?), we can define the probability associated with each value from 1 to 100 as: \[
p_i = P(X < x_i)
\] where \(p_i\) is the probability of observing a value \(X\) which is lower than \(x_i\). Then we can construct the probability distribution function defined as: \[
f(X) = \sum_{x_i < X}p_i
\] In human language, this translates as: Take probabilities \(p_i\) of all values \(x_i\) lower than \(X\), compute their sum and you get the value of probability distribution function for value \(X\) (Fig 3.1a). Let’s simulate it:
set.seed(42)# First, we need a randomly generated sample from binomial distribution# n = number of trials when eagle comes up (out of 100)# size = number of observations (each observation = 100 tosses)# prob = probability of eagle coming up in each tosscoin.r <-rbinom(n =0:100, size =100, prob =0.5)# Second, we need density function to plot theoretical probabilities on the plotcoin.d <-dbinom(0:100, size =100, prob =0.5)# Third, we also need distribution functioncoin.p <-pbinom(0:100, size =100, prob =0.5)# This is needed for plotting cummulative histogramh <-hist(coin.r, breaks =c(0:100), plot = F)h$counts <-cumsum(h$counts)/sum(h$counts)# Plotspar(mfrow =c(2,2), mar =c(5, 4, 4, 2))plot(h,xlim =range(35, 65),freq = T,main ='Probability distribution \nof coin tosses',ylab ='Probability',xlab ='Less than X% eagle in 100 tosses')lines(coin.p, col ='red')mtext("(a)", side =3, adj =-0.2, line =2, font =2)hist(coin.r, breaks =c(35:65),xlab ='% of eagles in 100 tosses',main ='Frequency histogram \nof coin tosses')mtext("(b)", side =3, adj =-0.2, line =2, font =2)hist(coin.r, breaks =c(35:65),freq = F,xlab ='% of eagles in 100 tosses',ylab ='Density',main ='Probability density histogram \nof coin tosses')lines(coin.d, col ="red", lwd =1)mtext("(c)", side =3, adj =-0.2, line =2, font =2)par(mfrow =c(1,1))
Figure 1: Fig 3.1. Probability (a), frequency (b) and density (d) distribution of coin tosses (n = 100, size = 100, p = 0.5). Grey histograms represent sampling statistics (prob., freq., dens.). The red lines in (a) and (c) represent theoretical binomial probability distribution and density, respectively. (d) standard 10 crown coin of the Austrian-Hungarian Empire used for tossing. Depicted here to illustrate why we call the coin sides the Head and Eagle instead of Brno and Lion as on the current 10 CZK coin.
Another option to explore the distribution of values is to sample a random variable and examine the properties of such a sample. After you take such a sample (or make a measurement), i.e. record events generated by a random variable, corresponding values cease to be a random variable but become the data. The data values may be plotted on a histogram of frequencies (Fig 3.1b, see also Chapter 2).
The frequency histogram can be converted to a probability density histogram (Fig 3.1c) by scaling the area of the histogram to a 0-1 range. The density diagram has a great advantage in that probabilities of observing a value within a given interval can directly be read as the size of the area of a given column. The histograms shown in Fig 3.1.c indicate sampling probability distribution or density based on the data.
By contrast, the red lines (Fig 3.1a and 3.1c) indicate theoretical probability distribution or density, i.e. how the values should look if they followed the theoretical binomial distribution, which describes the coin-tossing process. As you can see, the sampling and theoretical distributions do not match exactly, but there does not seem to be any systematic bias. The density of theoretical probabilities can thus be viewed as an idealized density histogram.
Box 6. Head and Eagle
Ten corona coin of the Austrian-Hungarian Empire illustrates why we call the coin sides the Head and Eagle.
There are many types of theoretical distributions, which describe many different processes generating random variables. Each of these types can further have many shapes, which depend on the parameters of the probability distribution function. E.g. the shape of the binomial distribution, which describes our coin tossing problem, is defined by parameters \(p\) indicating the average probability of observing one outcome and \(size\), which is the number of trials (tosses in our case).
Coin tossing produced discrete values to which probabilities could directly be assigned because there is a limited number of possible outcomes. This is not possible with continuous variables, as the number of possible values is infinite. However, if you look back at the definition of the probability distribution function, this is not a problem because, for any value, you can find an interval of lower values.
Normal distribution
So far, our example utilized binomial distribution. Among many theoretical distribution types, we will focus on normal (Gaussian) distribution. This distribution describes a process producing values symmetrically distributed around the center of the distribution. Normal distribution can be used to describe (or approximate) distribution of variables measured on ratio and interval scale. It has two parameters, which define its shape:
the central tendency (expected value) called mean - i.e., sum of all values divided by the number of objects: \[
\mu = \frac{\sum_{i = 1}^{N}X_i}{N}
\]
the variance which defines the spread of the probability density - i.e., mean square of differences of individual values from the mean (how far from the centre do the values usually fall): \[
\sigma^2 = \frac{\sum_{i = 1}^{N}(X_i - \mu)^2}{N}
\] Variance is given in squared units of the variable itself (e.g. in m2 for length). Therefore, standard deviation (\(\sigma\), SD), which is simply the square root of the variance, is frequently used: \[
\sigma = \sqrt{\sigma^2}
\] The common notation of the normal distribution with mean \(μ\) and variance \(σ^2\) is \(N(μ, σ^2)\).
Normal distribution has non-zero probability density over the entire scale of real numbers. This implies that normal distribution may not always be a suitable proxy distribution of some variables, e.g. physical variables such as length or masses, because these cannot be lower than zero. However, normal density becomes close to zero if one moves several standard deviations (SD units) away from the mean (Fig. 3.2b). This means that normal distribution may be used for the always-positive variables (like length, mass etc.) only if the mean is reasonably far from zero (measured by SD units). At the same time, this implies that the existence of outlying values is not expected and normal approximation of variables containing them may be problematic.
par(mfrow =c(2,2)) # Set grid# (a)curve(dnorm(x, mean =120, sd =sqrt(49)), # Generate curve for Normal distribution with mean = 120 and sd = sqrt(49)from =40, to =160, # Set x axis rangeylim =range(0, 0.065), # set y axis rangeylab ='Probability density', # y lab namexlab ='') # x lab without nametext(x =120, y =0.06, labels ='N(120, 49)') # add annotation to the curvecurve(dnorm(x, mean =120, sd =sqrt(100)), col =3, add = T) # similarly for different means and sdstext(x =120, y =0.043, labels ='N(120, 100)', col =3)curve(dnorm(x, mean =100, sd =sqrt(100)), col =2, add = T)text(x =100, y =0.043, labels ='N(100, 100)', col =2)curve(dnorm(x, mean =100, sd =sqrt(400)), col =4, add = T)text(x =100, y =0.023, labels ='N(100, 400)', col =4)mtext("(a)", side =3, adj =-0.2, line =2, font =2) # label plot# (b)curve(dnorm(x, mean =100, sd =sqrt(100)), # Generate curve for Normal distribution with mean 100 and sd 10 (sqrt(100))from =60, to =140,col =2,main ='N(100, 100)',ylab ='Probability density',xlab ='')abline(v =100, col =2, lwd =2) # add lines indicating quartilesabline(v =c(100-10, 100+10), col =2, lty =2)abline(v =c(100-20, 100+20), col =2, lty =3)text(x =c(85, 95, 103, 105, 115), # annotate linesy =c(0.035, 0.035, 0.015, 0.035, 0.035),labels =c('mean\n - 2SD', 'mean\n - SD', 'mean', 'mean\n + SD', 'mean\n + 2SD'),col =2)text(x =75, y =0.015, labels ='2.5%\nquantile', col =2)text(x =125, y =0.015, labels ='97.5%\nquantile', col =2)arrows(75, 0.01, 80, 0.005, length =0, col =2)arrows(125, 0.01, 120, 0.005, length =0, col =2)mtext("(b)", side =3, adj =-0.2, line =2, font =2) # label plot# (c)curve(dnorm(x, mean =0, sd =1), # Generate curve for standardised Normal distribution (Z)from =-3, to =3,col =2, lwd =1,main ='Z = N(0, 1)',ylab ='Probability density',xlab ='Z')mtext("(c)", side =3, adj =-0.2, line =2, font =2) # label plotpar(mfrow =c(1,1)) # Set grid back to normal view
Figure 2: Fig 3.2. Normal distribution properties: (a) varying parameters, (b) SD-unit intervals, and (c) the standard normal distribution.