Inferential Statistics-Part-I mtech.pptx

ShaktikantGiri1 68 views 48 slides Mar 11, 2025
Slide 1
Slide 1 of 48
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48

About This Presentation

For mtech mining


Slide Content

Inferential Statistics Why do you need inferential statistics? Making an inference about the population from a sample. Allow us to use the information learned from descriptive statistics. It extends beyond the immediate data. Used to infer from sample data what is the population might think. Used to make judgements about the probability that an observed difference between groups is a dependable one or one that might have happened by chance in the study.

Inferential Statistics Statistical inference is the procedure by which we reach a conclusion about a population on the basis of the information contained in a sample drawn from that population. Statistical Inference (Two General Areas) Estimation Hypothesis Testing Estimation process use sample data to calculate some statistics that serves as an approximation of the corresponding parameter of the population from which the sample was drawn. Types of estimate Point estimate: A point estimate is a single numerical value used to estimate the corresponding population parameter. Interval estimate: An interval estimate consists of two numerical values defining a range of values that, with a specified degree of confidence, most likely includes the parameter being estimated.

Inferential Statistics A single computed value is referred to as an estimate . The rule that tells us how to compute this value, or estimate, is referred to as an estimator . Estimators are usually presented as formulas. For example, is an estimator of the population mean, µ. The single numerical value that results from evaluating this formula is called an estimate of the parameter µ. In many cases, a parameter may be estimated by more than one estimator. One of the desired property of a good estimator is unbiasedness. An estimator, say, T, of the parameter θ is said to be an unbiased estimator of θ if E(T) = θ .  

Inferential Statistics The sampled population is the population from which one actually draws a sample. The target population is the population about which one wishes to make an inference. Only when the target population and the sampled population are the same it is possible for one to use statistical inference procedures to reach conclusions about the target population. If the sampled population and the target population are different, the researcher can reach conclusions about the target population only on the basis of nonstatistical considerations like extrapolating findings to the target population.

Inferential Statistics CONFIDENCE INTERVAL FOR A POPULATION MEAN Suppose a researchers wishes to estimate the mean of some normally distributed population. Draw a random sample of size n from the normally distributed population and compute , which is used as a point estimate of µ. Random sampling inherently involves chance, cannot be expected to be equal to µ. It would be much more meaningful, therefore, to estimate µ by an interval that somehow communicates information regarding the probable magnitude of µ.  

Inferential Statistics CONFIDENCE INTERVAL FOR A POPULATION MEAN If sampling is from a normally distributed population, the sampling distribution of the sample mean will be normally distributed with a mean equal to the population mean µ, and a variance is equal to . From our knowledge of normal distributions, approximately 95% of the possible values of constituting the distribution are within two standard deviations of the mean. The two points that are two standard deviations from the mean will contain approximately 95%.  

Inferential Statistics Example: Suppose a researcher, interested in obtaining an estimate of the average level of some enzyme in a certain human population, takes a sample of 10 individuals, determines the level of the enzyme in each, and computes a sample mean of . Suppose further it is known that the variable of interest is approximately normally distributed with a variance of 45. We wish to estimate µ.  

Inferential Statistics An approximate 95% confidence interval for µ is given by Interval Estimate Components: The interval estimate contains in its centre the point estimate of µ. The 2 we recognize as a value from the standard normal distribution that tells us within how many standard errors lie approximately 95% of the possible values of . This value of z is referred to as the reliability coefficient . The last component, , is the standard error , or standard deviation of the sampling distribution of . In general, then, a interval estimate may be expressed as follows: Sampling from a normal distribution with known variance, an interval estimate for µ is where is the value of z to the left of which lies and to the right of which lies of the area under its curve.  

Inferential Statistics Interpreting Confidence Intervals: How do we interpret the interval given by Expression ? In the present example the reliability coefficient is equal to 2. We say that in repeated sampling approximately 95% of the intervals constructed by the above expression will include the population mean. This interpretation is based on the probability of occurrence of different values of . We may generalize this interpretation by designating the total area under the curve of that is outside the interval as 𝛂 and the area within the interval as 1- 𝛂.  

Inferential Statistics Probabilistic Interpretation: In repeated sampling, from a normally distributed population with a known standard deviation, 100(1- 𝛂) percent of all intervals of the form will in the long run include the population mean µ. The quantity (1- 𝛂), in this case 0.95, is called the confidence coefficient (or confidence level ), and the interval is called a confidence interval for µ. When (1- 𝛂)=0.95, the interval is called the 95% confidence interval for µ . Practical Interpretation: When sampling is from a normally distributed population with known standard deviation, we are 100(1- 𝛂) percent confident that the single computed interval, , contains the population mean. Precision: The quantity obtained by multiplying the reliability factor by the standard error of the mean is called the precision of the estimate . This quantity is also called the margin of error .  

Inferential Statistics Sampling from Nonnormal Populations: As noted, it will not always be possible or prudent to assume that the population of interest is normally distributed. Thanks to the central limit theorem, this will not deter us if we are able to select a large enough sample. We have learned that for large samples, the sampling distribution of is approximately normally distributed regardless of how the parent population is distributed. Example: Punctuality of patients in keeping appointments is of interest to a research team. In a study of patient flow through the offices of general practitioners, it was found that a sample of 35 patients was 17.2 minutes late for appointments, on average. Previous research had shown the standard deviation to be about 8 minutes. The population distribution was felt to be nonnormal. What is the 90% confidence interval for µ, the true mean amount of time late for appointments?  

Inferential Statistics Since the sample size is fairly large (greater than 30), and since the population standard deviation is known, we draw on the central limit theorem and assume the sampling distribution of to be approximately normally distributed. From the normal distribution Table, we find the reliability coefficient corresponding to a confidence coefficient of 0.90 to be about 1.645. The standard error is Therefore, 90% percent confidence interval for µ is Frequently, when the sample is large enough for the application of the central limit theorem, the population variance is unknown. In that case, we use the sample variance as a replacement for the unknown population variance in the formula for constructing a confidence interval for population mean.  

t-distribution One does not have knowledge of the population mean and variance, and cannot use the statistic to construct Confidence intervals of the mean. The statistic, is normally distributed when the population is normally distributed and is at least approximately normally distributed when n is large, regardless of the functional form of the population, we cannot make use of this fact because 𝜎 is unknown.  

t-distribution Most logical solution, we use the sample Standard Deviation (SD), as an approximation of 𝜎. This is justifiable when n ≥ 30 and also the use of normal distribution theory to construct confidence interval for When we have a small sample an alternative procedure for constructing a confidence interval is the use of Student’s t distribution or simply t distribution. William S Gosset was a statistician employed by the Guinness brewing company which had stipulated that he cannot publish his own name. (GOSSET -pseudonym of student) student’s t distribution .  

t-distribution Properties of t distribution: It has a mean = 0 symmetrical about the mean. In general, it has a variance greater than 1, but the variance approaches 1 as the sample size becomes large. For degree of freedom (df) > 2, the variance of t distribution is df/(df-2), since here df = n - 1 for n > 3 we may write the variance of the t distribution = (n-1)/(n-3) Variable t ranges from -∞ to +∞ Compared to the normal distribution the t distribution is less peaked in the centre and has higher tails. The t distribution approaches the normal distribution as n - 1 approaches infinity

t-distribution Confidence interval using t Source of reliability coefficient is different When sampling is from a normal distribution whose standard deviation , is unknown the 100(1- 𝛼 ) percent confidence interval for the population mean µ is given by Applicable: strictly valid when the sample be drawn from a normal distribution  

t-distribution: Applicability When to use t distribution: The t distribution can be used with any statistic having a bell-shaped distribution (approximately normal). The sampling distribution of a statistic should be bell-shaped. If any of the following conditions apply: The population distribution is normal. The population distribution is symmetric, unimodal, and without outliers, and the sample size is at least 30. The population distribution is moderately skewed, unimodal, without outliers and the sample size is at least 40. The sample size is greater than 40, without outliers The t distribution should not be used with small samples from populations that are not approximately normal.

t-distribution: Example A researcher conducted a study to evaluate the effect of on job body mechanism instruction on the work performance of newly employed young workers. He used two randomly selected groups of subjects, an experimental group and a control group. The experimental group received one hour of back-school training provided by an occupational therapist. The control group did not receive this training. A criterion-referenced body mechanics evaluation checklist was used to evaluate each worker’s lifting, lowering, pulling and transferring of objects in the work environment. A correctly performed task received a score of 1. The 15 control subjects made a mean score of 11.53 on the evaluation with a standard deviation of 3.681. We assume that these 15 controls behave as a random sample from a population of similar subjects. We wish to use these sample data to estimate the mean score for the population.

Standard normal and t-distribution Tables

t-distribution: Example Ans: We may use the sample mean 11.53 as a point estimate of the population mean but since the population standard deviation is unknown, we must assume that the population of values to be at least approximately normally distributed before constructing a confidence interval for µ. Let us assume that such an assumption is reasonable and that a 95% confidence interval is desired. We have our estimator x̄ and our standard error is We need now to find the reliability coefficient, the value of t associated with a confidence coefficient of 0.95 and n-1 =14 degree of freedom. Since a 95% confidence interval leaves 0.05 of the area under the curve of t to be equally divided between the two tails, we need the value of t to the right of which lies 0.025 of the area. From T distribution table at 14 degrees of freedom: t 0.975 = 2.1448 Hence the 95% confidence interval 11.53 ± 2.1448 × (0.9504) = 11.53 ± 2.04 = [9.49, 13.57] This interval may be interpreted from both the probabilistic and practical points of view. We are 95% of confidence that the true population mean µ is somewhere between 9.49 and 13.57 because, in repeated sampling, 95% of intervals constructed in like manner will include µ.  

Deciding between z and t When we construct a confidence interval for a population mean, we must decide whether to use a value of z or a value of t as the reliability factor. To make an appropriate choice we must consider sample size, whether the sampled population is normally distributed, and whether the population variance is known.

Confidence interval for the difference between two population means Confidence interval for the difference between population means provides information that is helpful and deciding whether or not it is likely that the two population means are equal. When the constructed interval does not include zero, we say that the interval provides evidence that the two-population means are not equal. When the interval includes zero, we say that the population mean may be equal. Sampling from Normal Populations with known variences Population variance are known the 100(1- )% confidence interval for µ 1 - µ 2 is given Sampling from Non-normal Populations The construction of a confidence interval for the difference between two population means when sampling is from non-normal populations proceeds in the same manner as sampling from normal populations if the sample sizes n 1 and n 2 are large. Again, this is a result of the central limit theorem . If the population variances are unknown, we use the sample variances to estimate them.  

Confidence interval for the difference between two population means Example: Despite common knowledge of the adverse effects of doing so, many women continue to smoke while pregnant. A researcher examined the effectiveness of a smoking cessation program for pregnant women. The mean number of cigarettes smoked daily at the close of the program by the 328 women who completed the program was 4.3 with a standard deviation of 5.22. Among 64 women who did not complete the program, the mean number of cigarettes smoked per day at the close of the program was 13 with a standard deviation of 8.97. We wish to construct a 99 percent confidence interval for the difference between the means of the populations from which the samples may be presumed to have been selected.

Confidence interval for the difference between two population means No information is given regarding the shape of the distribution of cigarettes smoked per day. Since our sample sizes are large, however, the central limit theorem assures us that the sampling distribution of the difference between sample means will be approximately normally distributed even if the distribution of the variable in the populations is not normally distributed. We may use this fact as justification for using the z statistic as the reliability factor in the construction of our confidence interval. Also, since the population standard deviations are not given, we will use the sample standard deviations to estimate them. The point estimate for the difference between population means is the difference between sample means, 4.3 – 13.0 = - 8.7. From normal distribution Table, we find the reliability factor to be 2.58. The estimated standard error is Our 99 percent confidence interval for the difference between population means is - 8.7 ±  2.58 (1.1577) = [- 11.7; - 5.7] We are 99 percent confident that the mean number of cigarettes smoked per day for women who complete the program is between 5.7 and 11.7 lower than the mean for women who do not complete the program.  

Confidence interval for the difference between two population means t distribution and the difference between means when population variances are unknown, and we wish to estimate the difference between two population means with a confidence interval we can use the t distribution as a source of reliability factor if certain assumptions are met. We must know or willing to assume, that the two sampled populations are normally distributed. Regarding unknown population variances, two situations may occur: Situation–I: population variances are equal, Situation–II: population variances are not equal.

Situation–I: population variances are equal If the assumption of equal population variance is justified, the two samples may be considered as the estimates of the same quantity, the common variance. Obtain a pooled estimate of the common variance. Pooled variance is the weighted average of the sample variances. Sample variances are weighted by their degree of freedom 100(1 - α) percent confidence interval for µ 1 - µ 2 degree of freedom used in determining the value of t in n 1 + n 2 – 2  

Example: population variances are equal The purpose of a researcher study was to determine the effect of long term exercise intervention on corporate executives enrolled in a supervised fitness program. Data were collected on 13 subjects (the exercise group) who voluntarily entered a supervised exercise program and remained active for an average of 13 years and 17 subjects (the secondary group) who elected not to join the fitness program. Among the data collected on the subjects was the maximum number of sit-ups completed in 30 seconds. The exercise group has a mean and standard deviation for this variable of 21.0 and 4.9 respectively. The mean and standard deviation for the sedentary group were 12.1 and 5.6 respectively. We assume that the two populations of overall muscle condition measures are approximately normally distributed and that the two population variances are equal. We wish to construct a 95% confidence interval for the difference between the means of the populations represented by these two samples.

Example: population variances are equal Pooled estimate of the common population variance from t distribution table (13+17-2) = 28 degree of freedom and desired 0.95 confidence interval, reliability factor 2.0484 Confidence interval = ( 21.0 - 12.1) 2.0484 8.9 4.0085 = We are 95% confident that the difference between the population mean is somewhere between 4.9 and 12.9. We can say this because we know that if we were to repeat the study many, many times and compute confidence intervals in the same way, about 95% of the intervals would include the difference between the population mean. Since the interval does not include zero, we conclude that the population means are not equal. We can interpret this interval that the difference between the two population means is estimated to be 8.9 and we are 95% confident that the true value lies between 4.9 and 12.9.  

Confidence interval for the difference between two population means Population variance is not equal When one is unable to conclude that the variances of two populations of interest are equal even though the two populations may be assumed to be normally distributed, it is not proper to use t distribution. Solutions has been proposed by many researchers. But the problem resolves around the fact that the quality does not follow t-distribution with n 1 + n 2 - 2 degree of freedom when the population variances are not equal.  

Confidence interval for the difference between two population means Population variance is not equal The solution proposed by Cochran consists of completing the reliability factor by the following formula: Where for n 1 - 1 degrees of freedom, and for n 2 - 1 degrees of freedom. An approximate 100(1-𝛼) percent confidence interval for µ 1 - µ 2 is given by  

Example: population variances are not equal The purpose of a research study was to determine the effect of long-term exercise intervention on corporate executives enrolled in a supervised fitness program. Data were collected on 13 subjects (the exercise group) who voluntarily entered a supervised exercise program and remained active for an average of 13 years and 17 subjects (the secondary group) who elected not to join the fitness program. Among the data collected on the subjects were the maximum number of sit-ups completed in 30 seconds. The exercise group has a mean and standard deviation for this variable of 21.0 and 4.9 respectively. The mean and standard deviation for the sedentary group were 12.1 and 5.6 respectively. We assume that the two populations of overall muscle condition measures are approximately normally distributed and that the two population variances are not equal. We wish to construct a 95% confidence interval for the difference between the means of the populations represented by these two samples.

Example: population variances are not equal We will use Cochran reliability factor t’. From t distribution Table with 12 degrees of freedom and . Similarly, with 16 degrees of freedom and . We now compute we now construct the 95 percent confidence interval for the difference between the two population means. Since the interval does not include zero, we conclude that the population means are not equal. We can interpret this interval that the difference between the two population means is estimated to be 8.9 and we are 95% confident that the true value lies between 7.9348 and 16.8348.  

Determination of sample size for estimating mean Planning of any survey experiment - How large a sample to take? Larger than needed - wasteful of resource Smaller than needed - lead to a result of no practical use objective:- Interval estimation should have a narrow interval 2) high reliability. Total width of the interval is twice the magnitude of the quality : Increasing reliability means a larger reliability coefficient---> increase interval.  

Determination of sample size for estimating mean Fixed reliability coefficient and reduce standard error standard error = , is fixed the only way is to increase n --> take a larger sample How large?--> depends on the desired degree of reliability and the desired interval width sampling with replacement from an infinite or sufficiently large population formula sampling from small finite population without replacement formula  

Determination of sample size for estimating mean Sample size estimation formulas require the and population variance is unknown. The most frequently used source for estimation of are: A pilot or preliminary sample may drawn from the population and computed sample variance(S 2 ) may be used to estimate . Observations used in the pilot sample may be computed on a part of the final sample: n(the computed sample size) - n 1 (the pilot sample size) = n 2 (the number of observation needed to satisfy the total sample size requirement) Estimates of Sigma square may be available from previous or similar studies.  

Inference about population variance Confidence interval for the variance of a normally distributed population Distribution normal? - used sample variance as an approximate estimator of population variance Wonder about the quality? - check whether the sample variance is an unbiased estimator of population variance. To be unbiased - average value of the sample variance over all possible sample must be equal to the population variance. E( ) =  

Inference about population variance Draw all possible samples of size two from the population consisting of the values 6, 8, 10, 12 and 14. If we compute the sample variance For each of the possible samples, we obtain the sample variances as shown in the table Sampling with replacement E(s 2 ), the expected value of the mean of the sample variance, (0 + 2 + ….. + 2 + 0)/25 = 8 Hence E(s 2 ) =     S E C O N D D R A W 6 8 10 12 14 F I R S T D R A w 6 8 10 12 14 2 8 18 32 2 2 8 18   8 2 2 8   18 8 2 2   32 18 8 2  

Inference about population variance Sampling without replacement: the expected value of s 2 , (0 + 2 + ….. + 2 + 0)/10 = 10 Then E(s 2 ) = where sampling is with replacement. Results justify the use of for computing the sample variance. E(s 2 ) when sampling is without replacement. Interval estimation of a population variance Success depends on our ability to find an approximate sampling distribution Confidence interval for are usually based on the sampling distribution of If sample of size n are drawn from a normally distributed population, this quality [ ] has a distribution known as Chi square distribution with n - 1 degrees of freedom. To obtain a 100(1-α)% confidence interval for we select values of from the table in such a way that α/2 is to the left of smaller value of and α/2 is to be the right of the larger values of .  

Inference about population variance The 100(1-α) % confidence interval for is Confidence interval for , population standard deviation is Method is widely used but have some draw back. Normality of the population is crucial. Estimator is not in the centre of the confidence interval because distribution is not symmetric.  

Inference about population variance A random sample of 20 nominally measured 2 mm diameter steel ball bearings is taken and the diameters are measured precisely. The measurements, in mm, are as follows: 2.02 1.94 2.09 1.95 1.98 2.00 2.03 2.04 2.08 2.07 1.99 1.96 1.99 1.95 1.99 1.99 2.03 2.05 2.01 2.03 Assuming that the diameters are normally distributed with unknown mean  and unknown variance  2 , Find a 2-sided 95% confidence interval for the variance  2 Find a 2-sided confidence interval for the standard deviation 

Inference about population variance From the data we calculate and , and . Hence, There are 19 degrees of freedom and the critical values of the Chi-square distribution and The confidence interval for  2 is The confidence interval for  is  

Inference about population variances Confidence interval for the ratio of the variances of the two normally distributed populations: One way of comparing two variances is to compute their ratio To use t distribution for constructing a confidence interval for the difference between population means requires that the population variances be equal. If the two variances are equal, the ratio will be one. If the confidence interval for the ratio of two populations variances includes 1, we conclude that the two populations variances, may in fact, be equal. This is a form of inference, and we must rely on some sampling distribution. This time the distribution of is utilised provided the following assumptions are met. Assumption: and are computed from independent samples of size n 1 and n 2 Sample have been drawn from two normally distributed populations let If the assumption are met, follows a F distribution  

Inference about population variances F-distribution: If U and V are independent chi-square random variables with r 1 and r 2 degrees of freedom, respectively, then: follows an F distribution with r 1 numerator degrees of freedom and r 2 denominator degrees of freedom. This F distribution depends on two-degrees-of-freedom values, one corresponding to the value of n 1 – 1 used in computing , and the other corresponding to the value of n 2 – 1 used in computing . They are usually referred to as the numerator degree of freedom and denominator degree of freedom . A confidence interval for at 100(1-α) % confidence is constructed by the values from F table to the left lies α/2 of the area the value from F table to the right lies α/2 of the area  

Inference about population variances The values of at the intersection of the column headed df 1 and the row labelled df 2 . If we have extensive table of F distribution, finding would be no trouble. To include every possible percentile of F would make a very lengthy table. A relationship exists to compute  

Inference about population variances Problem# The variability in the thickness of oxide layers in semiconductor wafers is a critical characteristic, where low variability is desirable. A company is investigating two different ways to mix gases so as to reduce the variability of the oxide thickness. We produce 16 wafers with each gas mixture and our results indicate that the standard deviation is s 1 = 1.96Å and s 2 = 2.13Å for the two mixtures. What is the 95% confidence interval for the ratio between the two variances?

Inference about population variances G iven information: Sample s ize of population 1: n 1 = 16; sample standard deviation for sample from population 1: s 1 = 1.96; Sample size of population 2: n 2 = 16; S ample standard deviation for sample from population 2: s 2 = 2.13. Since we are looking for a 95% confidence interval, we need two f values:  

Inference about population variances  
Tags