Programming Notes | Machine Learning | Probability and Statistics

Gaussian Distribution

The gaussian distribution is perhaps the most well known and widely used distribution. Also known as the normal distribution, it has some nice properties that allow us to model much of the observation that we encounter naturally with data.

The normal distrbution is mathematically represented as:

$$ X \sim \mathcal{N}(\mu, \sigma) $$

where:
$\mu$: is the mean of distribution
$\sigma^2 $: is the variance of the distribution

Gaussian Random Variable

Scipy norm method provides an easy way to generate a gaussian random variable. In the example below, we generate a random variable with the $\mu = 0$ and $\sigma = 1$. Notice the following:

$loc$: The mean of the distribution
$scale$: The standard deviation of distribution

from scipy.stats import norm
norm.rvs(loc=0, scale=1, size=30)

array([ 1.8352436 , -0.56858646, -0.5406147 , -0.63665399, 0.72712676, 1.70597318, -0.31337281, -1.16440709, 1.18365675, 0.55631329, 0.85328682, 1.7280863 , 0.66794921, 1.0759896 , 0.17622743, 0.46568839, -0.44510858, -0.56196298, 1.94692798, -0.04223889, -0.51098755, 1.54208412, -2.49233376, -1.13626944, -0.76935402, -0.1599699 , -1.16948394, 0.46073899, 0.49325095, -0.31137008])

Visualizing the Random Variable

Similar to what we have seen with other distributions, we can visualize the histogram of the distribution. Below is the code to render the histogram visualization of the gaussian distribution.

import seaborn as sns

norm_rv = norm.rvs( size=10000 , loc=0, scale=1)
sns.distplot( norm_rv, kde= False, bins=20, hist_kws=dict(edgecolor="k", linewidth=1) )

Properties of Normal Distribution

I mentioned earlier that the gaussian distribution is one of the most widely used distribution. Below are some features that make this distribution practical for modelling purposes

Normal distributions are symmetric around their mean.
The mean, median, and mode of a normal distribution are equal.
68% of the area of a normal distribution is within one standard deviation of the mean.
Approximately 95% of the area of a normal distribution is within two standard deviations of the mean.

Probability Density Function

The probability density function is given by the formula below:

$$ P(x\ |\ \mu, \sigma^2) = \frac {1}{\sigma \sqrt{2\pi}} e^{\frac {-(x-\mu)^2}{2\sigma^2}} $$

The probability density function for the normal distribution estimates the probability of observing an estimate range of values drawn over the range provided by the normal distribution parameters.

For example, given a normal distribution centered at 5 with a standard deviation of 1, what is the probability that within a random draw, a number less than 3 is drawn.

Notice that:
$loc:$ - mean of the distribution
$scale:$ - standard deviation of the distribution

norm.pdf(x=2, loc=5, scale=1)

0.0044318484119380075

We see that the probability of obtaining a number two from a normal distribution $X \sim \mathcal{N} (5, 1)$ is very low as expected because $2$ is more than $\sigma$ from 5.

Cumulative Density Function

The cdf of the gaussian distribution is given by the formula:

$$ P(X | \mu, \sigma^2) = \frac {1}{\sigma \sqrt{2\pi}} \int_{-\infty}^{x} e^ {\frac {-(x-\mu)^2}{2 \sigma^2}} $$

We are never really going to have to worry about the formula because python provides a much easier and intuitive way of computing the cumulative probability.

Below, we create a normal distribution with $\mu = 3$ and $\sigma^2 = 4$ and calcuate the probability $p(x \leq 2.5 )$

norm_dist = norm( loc=3, scale=2)
    norm_dist.cdf(x = 2.5)

0.4012936743170763

Looks just about right because $2.5$ is closer to the mean and therefore likely to be drawn form the distribution given the magnitude of $\sigma$

Expected Value

The expected value of the guassian distribution is the $\mu$. We just return the mean of the distribution. Notice that the $loc$ parameter is where we specify the mean of the normal distribution.

norm_dist = norm( loc=3, scale=2)
norm_dist.mean()

3.0

Variance

We can return the variance of the distribution using the var method. As we know, var = $\sigma^2$ therefore we expect that the variance will be the square of the $scale$ parameter

norm_dist = norm( loc=3, scale=2)
norm_dist.var()

16.0

Standard Deviation

Finally, the standard deviation of the normal distribution is the $\sigma$ parameter -$scale$ - we set when instantiating a normal distribution.

norm_dist = norm( loc=3, scale=2)
norm_dist.std()

2.0