Basic Probability Distributions in R Programming By Dr. Mohamed Surputheen
probability distributions in R Many statistical tools and techniques used in data analysis are based on probability. Probability measures how likely it is for an event to occur on a scale from 0 (the event never occurs) to 1 (the event always occurs). A probability distribution describes how a random variable is distributed; it tells us which values a random variable is most likely to take on and which values are less likely . R comes with built-in implementations of many probability distributions. Each probability distribution in R is associated with four functions which follow a naming convention: The d-prefix function calculates the probability density function (PDF) of a continuous probability distribution, or the probability mass function (PMF) of a discrete probability distribution , at a specific value of the random variable . The p-prefix function calculates the cumulative distribution function (CDF) of a probability distribution, which gives the probability of observing a value less than or equal to a given value of the random variable The q-prefix function calculates the quantile of a probability distribution, which is the inverse of the CDF . The r-prefix function generates random numbers from a probability distribution
Binomial Distribution The binomial distribution is a discrete probability distribution that describes the number of successes in a fixed number of independent trials with two possible outcomes (success or failure) and a constant probability of success for each trial . A binomial experiment has the following properties: experiment consists of n identical and independent trials each trial results in one of two outcomes: success or failure P(success) = p P(failure) = q = 1 - p for all trials The random variable of interest, X , is the number of successes in the n trials. X has a binomial distribution with parameters n and p
Binomial Distribution If the probability of success in each trial is given by p , then the probability of getting exactly x successful events among n trials is given by the Binomial PMF or -n is the total number of trials -x is the number of successes -p is the probability of success on each trial.
Mean , variance and Standard deviation of Binomial Distribution The mean, E(X) = p + p + … + p = n*p The variance, V(X) = pq + pq + … + pq = n* pq The standard deviation =
Probability Computations Related to Binomial Distributions R has several functions related to the binomial distribution. Here are some commonly used ones: 1. dbinom (x , size, prob ) - Probability Mass Function (PMF) or probability distribution of the binomial distribution . Calculates the probability of getting exactly x successes in size trials, given a probability prob of success on each trial . 2. pbinom (q , size, prob ) - Cumulative Distribution Function (CDF) of the binomial distribution. Calculates the probability of getting up to q successes in size trials, given a probability prob of success on each trial . 3. qbinom(p, size, prob ) - Inverse CDF of the binomial distribution . Calculates the smallest number q such that the CDF is less than or equal to p , given size trials and a probability prob of success on each trial . ( ie ) This function takes the probability value and gives a number whose cumulative value matches the probability 4. rbinom (n, size, prob ) - Random number generator for the binomial distribution . Generates n random samples from a binomial distribution with size trials and a probability prob of success on each trial .
Binomial probabilities using dbinom () function in R dbinom is the function used to find the probability mass function for the binomial distribution . The function ‘ dbinom ’ is used to obtain the exact probability using Binomial distribution, i.e. P(X=x). The syntax to compute the probability at x for binomial distribution using R is dbinom ( x,size,prob ) where x : the value(s) of the variable, size : the number of trials, and prob : the probability of success ( prob ). The dbinom () function gives the probability for given value(s) x (no. of successes), size (no. of trials) and prob (probability of success ).
Example : dbinom A coin is tossed 5 times. What is the probability of getting one head and three heads? To solve this problem using R language, we can use the dbinom () function, which calculates the binomial probability mass function. For getting one head: > dbinom (1, size=5, prob =0.5) [1] 0.15625 For getting three heads: > dbinom (3, size=5, prob =0.5) [1] 0.3125 Manual verification The probability of getting a head on a single coin toss is 1/2, and the probability of getting a tail is also 1/2. To find the probability of getting a certain number of heads in 5 coin tosses, we can use the binomial probability formula : P(x) = nCx p x q (n-x) Given n=5, p=0.5,q=0.5 So for getting one head, we have: P(X=1 ) = ( 5C1 ) * (1/2 )^1 * (1/2)^4 = 5/32 =0.15625 For getting three heads, we have: P(X=3 ) = (5 C 3) * (1/2)^3 * (1/2)^2 = 10/32 = 5/16 =0.3125
Binomial cumulative probability using pbinom() function in R The syntax to compute the cumulative probability distribution function (CDF) for binomial distribution using R is pbinom( q,size,prob ) where q : the value(s) of the variable, size : the number of trials, and prob : the probability of success ( prob ). This function is very useful for calculating the cumulative binomial probabilities for given value(s) of q (value of the variable x), size (no. of trials) and prob (probability of success).
In a university 45% of the students are female. A random sample of 10 students are selected . What is the probability that 2 or less female students are selected ? Answer: pbinom( q,size,prob ) pbinom (2,10,0.45) [1] 0.09955965
Binomial Distribution Quantiles using qbinom () in R qbinom is the R function that calculates the inverse CDF (or quqntiles ) of the binomial distribution. W.k.t, This function takes the probability value and gives a number whose cumulative value matches the probability value. The syntax to compute the inverse CDF or quantiles of binomial distribution using R is qbinom ( p,size,prob ) where p : the value(s) of the probabilities, size : the number of trials, and prob : the probability of success ( prob ). The function qbinom ( p,size,prob ) gives the Inverse CDF of Binomial distribution for given value of p, size and prob. Note: qbinom is the inverse of the pbinom function . pbinom calculates the cumulative probability distribution function (CDF) of a binomial random variable , while qbinom calculates the inverse CDF or the quantile function of the binomial distribution. Diff between CDF and invers CDF The CDF represents the probability that a random variable takes on a value less than or equal to a given value. The inverse CDF, on the other hand, does the opposite. It takes a probability as input and returns the value of the random variable that corresponds to that probability.
Example problem that demonstrates the relationship between pbinom and qbinom : Suppose we flip a fair coin 10 times. What is the probability of getting 3 or fewer heads ? To solve this problem using pbinom, we can set n = 10 and p = 0.5 (since the coin is fair) and use the following code : > pbinom(3, 10, 0.5) [1] 0.171875 This returns a probability of approximately 0.1719, meaning there is a 17.19% chance of getting 3 or fewer heads in 10 coin flips. To solve this problem using qbinom , we can again set n = 10 and p = 0.5 and use the following code : > qbinom( 0.171875, 10, 0.5 ) # This function takes the probability value and gives a number whose cumulative value matches the probability value. [1] 3 This returns a value of 3, which confirms that the probability of getting 3 or fewer heads is approximately 0.1719. Here, we used qbinom to find the value of k such that P(X ≤ k) = 0.1719.
Simulating Binomial random variable using rbinom () function in R The general R function to generate random numbers from Binomial distribution is rbinom ( n,size,prob ) where, n is the sample size, size is the number of trials, and prob is the the probability of success in binomial distribution. The function rbinom ( n,size,prob ) generates n random numbers from Binomial distribution with the number of trials size and the probability of success prob . Example: Generate 8 random values from a sample of 150 with probability of 0.4. > x <- rbinom(8,150,.4) > x [1] 61 51 54 54 56 62 62 48
Poisson Distribution The Poisson distribution is a probability distribution that describes the probability of a certain number of events occurring within a fixed time or space interval, given the average rate of occurrence( ) of those events ( ie ) The Poisson distribution models the probability of a certain number of events occurring in a fixed interval of time, given the average rate at which the events occur. The binomial distribution models the probability of a fixed number of successes in a fixed number of independent trials , while the Poisson distribution models the probability of a fixed number of occurrences in a fixed time or space interval .
In 1837 French mathematician Simeon Dennis Poisson derived the distribution as a limiting case of Binomial distribution. It is called after his name as Poisson distribution. Conditions: (i) The number of trails ‘ n’ is indefinitely large i.e., n →∞ (ii) The probability of a success ‘ p ’ for each trial is very small i.e., p →0 (iii) np = is finite (iv) Events are Independent
The random variable X is said to follow the Poisson probability distribution if it has the probability function: The pmf is given by P(X=x )= p(x) = e - l l x / x! , for x=0,1,2… where P(x ) = the probability of x successes over a given period of time or space, given = the expected number of successes per time > 0 e = 2.71828 (the base for natural logarithms) The mean of the distribution is λ. The variance of the distribution is also λ. The standard deviation of the distribution is √λ.
Probability Computations Related to Poisson Distributions in R In R, you can use the dpois(), ppois(), qpois (), and rpois () functions to work with the Poisson distribution. 1.dpois(x , lambda) calculates the Probability Mass Function ( PMF ) of the Poisson distribution at a specific value of x, given a Poisson parameter lambda. 2. ppois (q, lambda) calculates the Cumulative Distribution Function (CDF) of the Poisson distribution at a specific value of q, given a Poisson parameter lambda. 3. qpois (p , lambda) calculates the Inverse Cumulative Distribution Function ( quantile function) of the Poisson distribution at a specific probability value p, given a Poisson parameter lambda. 4. rpois (n , lambda) generates n random samples from a Poisson distribution with a Poisson parameter lambda .
dpois The dpois function calculates the probability mass function for a Poisson distribution, given a particular value x and a parameter lambda . dpois(x , lambda) x : number of successes lambda: average rate of success Example: Suppose a call center receives an average of 10 customer calls per hour. What is the probability that the call center will receive exactly 7 calls in the next hour ? Ans : > lambda <- 10 > x <- 7 > prob <- dpois(x, lambda) > prob [1] 0.09007923 To solve this problem manually using the Poisson distribution, we can use the formula : P(X=x)= p(x) = e - l l x / x! where lambda is the average number of events per interval (in this case, 10 customer calls per hour), x is the number of events we're interested in (in this case, 7 customer calls in the next hour), and e is the mathematical constant approximately equal to 2.71828 . P(X = 7) = ( e - 10 * 10 7 ) / 7! = (0.0000454 * 10,000,000) / (7 * 6 * 5 * 4 * 3 * 2 * 1) = 0.09008 Therefore, the probability of receiving exactly 7 calls in the next hour is approximately 0.090 or 9.0%.
ppois In R, you can use the ppois function to calculate the Cumulative Distribution Function ( CDF) of the Poisson distribution. The CDF gives the probability of getting k or fewer events in a certain interval of time, given the average rate of events per unit time. ppois(q, lambda) q : number of successes lambda: average rate of success Examples: It is known that a certain hospital experience 4 births per hour. In a given hour, what is the probability that 4 or less births occur? Answer : Using the Poisson Distribution with λ = 4 and x = 4, we find that P(X≤4) = 0.62884 . > ppois (4,4) [1] 0.6288369 So the probability of 4 or fewer births in an hour is approximately 0.6288 or 62.88%, which matches the result we obtained earlier. To solve this problem manually using the Poisson distribution, we can use the formula: P(X=x )= p(x) = e - l l x / x! where λ is the average rate of events per hour and x is the number of events. To find the probability of 4 or fewer births in an hour, we need to calculate the probabilities for k = 0, 1, 2, 3, and 4, and add them up: P(0 ) = ( 4 * e (-4) )/ 0! = 0.0183 P(1) = ( 4 1 * e (- 4 ) ) / 1! = 0.0733 P(2) = ( 4 2 * e (-4) ) / 2! = 0.1465 P(3) = ( 4 3 * e (-4) ) / 3! = 0.1953 P(4) = ( 4 4 * e (-4) ) / 4! = 0.1953 Therefore , the probability of 4 or fewer births in an hour is: P(0 or 1 or 2 or 3 or 4) = P(0) + P(1) + P(2) + P(3) + P(4) = 0.6287
Qpois qpois is a function in the R programming language that calculates the inverse cumulative distribution function (also known as the quantile function) for the Poisson distribution. qpois (p , lambda) p : the probability value for which you want to find the Inverse CDF. lambda : the average rate of events per unit time . Example: Suppose that the number of people who visit a website in a day follows a Poisson distribution with a mean of 500 people. What is the minimum number of people that we can expect to visit the website in a day with a probability of at least 95 %? To solve this problem using qpois , we can first find the Poisson distribution value that corresponds to a probability of 0.95 using the qpois function and the mean value of 500 : > qpois(0.95 , 500) [1] 537 This means that we can expect at least 537 people to visit the website in a day with a probability of at least 95 %
rpois rpois is a function in R that generates random numbers from a Poisson distribution with a specified mean. The function takes two arguments: the number of random numbers to generate (n) and the mean of the Poisson distribution (lambda ). rpois (n , lambda) n : number of random variables to generate lambda: mean of the Poisson distribution Example: suppose we want to generate 10 random numbers from a Poisson distribution with a mean of 5 > rpois(10, 5) [1] 5 3 8 2 6 6 4 2 3 8
The Normal Distribution In probability theory and statistics, the Normal Distribution, also called the Gaussian Distribution , is the most significant continuous probability distribution . The normal distribution is a bell-shaped, symmetrical distribution( the values to the left of the mean are a mirror image of the values to the right of the mean . ) in which the mean, median and mode are all equal. If the mean, median and mode are unequal, the distribution will be either positively or negatively skewed. A continuous random variable X having the bell-shaped distribution is called a normal random variable.
A random variable X is said to have a Normal distribution with parameters with mean μ and variance 2 if its probability density function is given by It is denoted by X ~ N (μ, 2 ) Where f(x) = frequency of random variable x = 3.14159; e = 2.71828 = population standard deviation ( > ) x = value of random variable -∞ < x <∞ µ = population mean( -∞< μ <∞ )
Properties of Normal Distribution It is a continuous distribution The normal distribution curve is bell-shaped. The mean, median, and mode are equal( Mean = Median = Mode = μ ) and located at the center of the distribution. The normal distribution curve is unimodal (single mode). The curve is symmetrical about the mean . ( ie ) Each half of the distribution is a mirror image of the other half. It is asymptotic to the horizontal axis. That is, it does not touch the x-axis and it goes on forever in each direction. The random variable 𝑥 can take any value from −∞ 𝑡𝑜 ∞ . The total area under the normal distribution curve is equal to 1 or 100 %.( ie )
Standard Normal Distribution The simplest case of a normal distribution is known as the standard normal distribution. This is a special case when µ=0 and = 1, and it is described by this probability density function. If X ~ N(µ, σ 2 ), let Z = (X - µ) / σ, [Z-transformation] then E(Z) = 0, V (Z) = 1. ( i.e )Z ~ N(0, 1), Z is said to have a standard normal distribution.
Probability Computations Related to Normal Distributions in R d norm: density function of the normal distribution p norm : cumulative density function of the normal distribution q norm: quantile function of the normal distribution r norm : random sampling from the normal distribution
Normal probabilities using dnorm () function in R dnorm The function dnorm returns the value of the probability density function (pdf) of the normal distribution given a certain random variable x , a population mean μ and population standard deviation σ . The syntax for using dnorm is as follows: dnorm (x, mean, sd ) Example: The GRE( Graduate Record Examinations ) is widely used to help predict the performance of applicants to graduate schools. The range of possible scores on a GRE is 200 to 900. The psychology department at a university finds that the students in their department have scores with a mean of 544 and standard deviation of 103. Find the value of the density function at x=550 > dnorm (550,544,103) [1] 0.00386666 Manual verification The value of the density function at x=550 is
Normal cumulative Density Function using pnorm () function in R The function pnorm returns the value of the Cumulative Density Function (CDF) of the normal distribution given a certain random variable q , a population mean μ and population standard deviation σ . The syntax for using pnorm is as follows: pnorm (q, mean, sd ) ( ie ) pnorm is the cumulative density function for the normal distribution. By definition pnorm (x) = P(X ≤ x) Example : The GRE(Graduate Record Examinations ) is widely used to help predict the performance of applicants to graduate schools. The range of possible scores on a GRE is 200 to 900. The psychology department at a university finds that the students in their department have scores with a mean of 544 and standard deviation of 103. Find the probability that a student in psychology department has a score less than 480 we need to find the probability P(X≤480) > pnorm (480,544,103) [1] 0.2671816
Normal Distribution Quantiles using qnorm () in R qnorm The function qnorm returns the value of the inverse cumulative density function ( cdf ) of the normal distribution given a certain random variable p , a population mean μ and population standard deviation σ . The syntax for using qnorm is as follows: qnorm (p, mean, sd ) qnorm is the inverse function for pnorm. Example: Suppose that the heights of a certain population follow a normal distribution with a mean of 170 cm and a standard deviation of 5 cm. What is the height below which 90% of the population lies ? > qnorm (0.9,170,5) [1] 176.4078 So the height below which 90% of the population lies is approximately 178.16 cm.
Simulating Normal random variable using rnorm () function in R rnorm is a function in R that generates random numbers from a normal distribution . rnorm (n , mean, sd ) This function generates n random numbers from Normal distribution with given mean and sd rnorm generates random values from a standard normal distribution. The required argument is a number specifying the number of normal variates to produce. Example: generate 10 random numbers from a normal distribution with a mean of 5 and a standard deviation of 2 : > rnorm( 10 , 5, 2) [1] 7.448164 5.719628 5.801543 5.221365 3.888318 8.573826 5.995701 1.066766 [9] 6.402712 4.054417