Introduction to Statistics Presentation.pptx

AniqaZai1 162 views 83 slides May 08, 2024
Slide 1
Slide 1 of 83
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83

About This Presentation

This presentation is about introduction to statistics and data analysis using statistical methods. Population, sampling and all the methods to test them are included.


Slide Content

INTRODUCTION TO STATISTICS

Definition of Statistics The term statistics refers to a set of mathematical procedures for organizing, summarizing, and interpreting information. Statistical procedures help ensure that the information or observations are presented and interpreted in an accurate and informative way. In somewhat grandiose terms, statistics help researchers bring order out of chaos. In addition, statistics provide researchers with a set of standardized techniques that are recognized and understood throughout the scientific community . 2

Population and Sample A population is the set of all the individuals of interest in a particular study. As you can well imagine, a population can be quite large, for example, the entire set of women on the planet Earth. A researcher might be more specific, limiting the population for study to women who are registered voters in the United States. A sample is a set of individuals selected from a population, usually intended to represent the population in a research study. Just as we saw with populations, samples can vary in size. For example, one study might examine a sample of only 10 students in a graduate program and another study might use a sample of more than 10,000 people who take a specific cholesterol medication. 3

4

Variable and Data A variable is a characteristic or condition that changes or has different values for different individuals. Once again, variables can be characteristics that differ from one individual to another, such as height, weight, gender, or personality. Also, variables can be environmental conditions that change such as temperature, time of day, or the size of the room in which the research is being conducted. Data (plural) are measurements or observations. A data set is a collection of measurements or observations. A datum (singular) is a single measurement or observation and is commonly called a score or raw score . 5

Parameters and Statistics A parameter is a value, usually a numerical value, that describes a population. A parameter is usually derived from measurements of the individuals in the population . For example, we want to know  the average length of a butterfly . This is a parameter because it is states something about the entire population of butterflies. A statistic is a value, usually a numerical value, that describes a sample. A statistic is usually derived from measurements of the individuals in the sample. For example, the parameter may be  the average height of 25-year-old men in North America . The height of the members of a sample of 100 such men are measured; the average of those 100 numbers is a statistic. 6

Descriptive and Inferential Statistical Methods Descriptive statistics are statistical procedures used to summarize, organize, and simplify data. Descriptive statistics are techniques that take raw scores and organize or summarize them in a form that is more manageable. Often the scores are organized in a table or a graph so that it is possible to see the entire set of scores. Another common technique is to summarize a set of scores by computing an average. Inferential statistics consist of techniques that allow us to study samples and then make generalizations about the populations from which they were selected. Because populations are typically very large, it usually is not possible to measure everyone in the population. Therefore, a sample is selected to represent the population. 7

Sampling error is the naturally occurring discrepancy, or error, that exists between a sample statistic and the corresponding population parameter. 8

Constructs and Operational Definitions Constructs are internal attributes or characteristics that cannot be directly observed but are useful for describing and explaining behaviour. Constructs exist at a higher level of abstraction than concepts . Justice, Beauty, Happiness, and Health are all constructs. An operational definition identifies a measurement procedure (a set of operations) for measuring an external behaviour and uses the resulting measurements as a definition and a measurement of a hypothetical construct. Note that an operational definition has two components. First, it describes a set of operations for measuring a construct. Second, it defines the construct in terms of the resulting measurements . 9

Discrete and Continuous Variable A discrete variable consists of separate, indivisible categories. No values can exist between two neighbouring categories. Discrete variables are commonly restricted to whole, countable numbers—for example , the number of children in a family or the number of students attending class . A discrete variable may also consist of observations that differ qualitatively. For example, people can be classified by gender (male or female), by occupation (nurse , teacher, lawyer, etc .) For a continuous variable , there are an infinite number of possible values that fall between any two observed values. A continuous variable is divisible into an infinite number of fractional parts. For example, two people who both claim to weigh 150 pounds are probably not exactly the same weight. However, they are both around 150 pounds. One person may actually weigh 149.6 and the other 150.3. Thus, a score of 150 is not a specific point on the scale but instead is an interval 10

Scale of Measurement 11

Nominal Scale and Ordinal Scale A nominal scale consists of a set of categories that have different names. Measurements on a nominal scale label and categorize observations, but do not make any quantitative distinctions between observations. The rooms or offices in a building may be identified by numbers. An ordinal scale consists of a set of categories that are organized in an ordered sequence . Measurements on an ordinal scale rank observations in terms of size or magnitude. Often , an ordinal scale consists of a series of ranks (first, second, third, and so on) like the order of finish in a horse race. Occasionally, the categories are identified by verbal labels like small, medium, and large drink sizes at a fast-food restaurant. 12

Interval Scale and Ratio Scale An interval scale consists of ordered categories that are all intervals of exactly the same size. Equal differences between numbers on scale reflect equal differences in magnitude . However, the zero point on an interval scale is arbitrary and does not indicate a zero amount of the variable being measured. A ratio scale is an interval scale with the additional feature of an absolute zero point . With a ratio scale, ratios of numbers do reflect ratios of magnitude. For example, you know that a measurement of 80° Fahrenheit is higher than a measure of 60°, and you know that it is exactly 20° higher. 13

Shape of Frequency Distribution In a symmetrical distribution , it is possible to draw a vertical line through the middle so that one side of the distribution is a mirror image of the other. In a skewed distribution , the scores tend to pile up toward one end of the scale and taper off gradually at the other end. The section where the scores taper off toward one end of a distribution is called the tail of the distribution . A skewed distribution with the tail on the right-hand side is positively skewed because the tail points toward the positive (above-zero) end of the X -axis. If the tail points to the left, the distribution is negatively skewed. 14

15

16

Introduction to Measures of Central Tendency Mean The mean for a distribution is the sum of the scores divided by the number of scores Median If the scores in a distribution are listed in order from smallest to largest, the median is the midpoint of the list. More specifically, the median is the point on the measurement scale below which 50% of the scores in the distribution are located. Mode If the scores in a distribution are listed in order from smallest to largest, the median is the midpoint of the list. More specifically, the median is the point on the measurement scale below which 50% of the scores in the distribution are located. 17

Exercise related to Central Tendency 18

Introduction to Variability Variability provides a quantitative measure of the differences between scores in a distribution and describes the degree to which the scores are spread out or clustered together . The range , is the distance covered by the scores in a distribution, from the smallest score to the largest score. Deviation is distance from the mean: D eviation score = X - μ   19

SS , or sum of squares , is the sum of the squared deviation scores . Variance equals the mean of the squared deviations. Variance is the average . Standard deviation is the square root of the variance and provides a measure of the standard, or average distance from the mean. 20

21

Exercise related to Variability 22

Introduction to z-score The z -score definition is adequate for transforming back and forth from X values to z -scores as long as the arithmetic is easy to do in your head. Z-scores are often used in  academic settings to analyze how well a student's score compares to the mean score on a given exam . For example, suppose the scores on a certain college entrance exam are roughly normally distributed with a mean of 82 and a standard deviation of 5. For more complicated values, it is best to have an equation to help structure the calculations. Fortunately, the relationship between X values and z -scores is easily expressed in a formula. The formula for transforming scores into z -scores is 23

The numerator of the equation, X – μ, is a deviation score. It measures the distance in points between X and μ and indicates whether X is located above or below the mean. The deviation score is then divided by σ because we want the z -score to measure distance in terms of standard deviation units. The formula performs exactly the same arithmetic that is used with the z -score definition, and it provides a structured equation to organize the calculations when the numbers are more difficult. 24

25

Exercise related to z-score 26

Introduction to H ypothesis T esting A hypothesis test is a statistical method that uses sample data to evaluate a hypothesis about a population. The Four Steps of a Hypothesis Test STEP 1 State the hypothesis. As the name implies, the process of hypothesis testing begins by stating a hypothesis about the unknown population. Actually, we state two opposing hypotheses . Notice that both hypotheses are stated in terms of population parameters. 27

The first and most important of the two hypotheses is called the null hypothesis . The null hypothesis states that the treatment has no effect . The null hypothesis is identified by the symbol H 0. The null hypothesis ( H ) states that in the general population there is no change, no difference , or no relationship. In the context of an experiment, H predicts that the independent variable (treatment) has no effect on the dependent variable (scores) for the population . 28

The second hypothesis is simply the opposite of the null hypothesis, and it is called the scientific, or alternative, hypothesis ( H 1) The alternative hypothesis ( H 1) states that there is a change, a difference, or a relationship for the general population. In the context of an experiment, H 1 predicts that the independent variable (treatment) does have an effect on the dependent variable. 29

30

31

32

Type I & Type II Error 33

A Type I error occurs when a researcher rejects a null hypothesis that is actually true . In a typical research situation, a Type I error means the researcher concludes that a treatment does have an effect when in fact it has no effect. A Type II error occurs when a researcher fails to reject a null hypothesis that is really false. In a typical research situation, a Type II error means that the hypothesis test has failed to detect a real treatment effect. 34

STEP 2 Set the criteria for a decision. Eventually the researcher will use the data from the sample to evaluate the credibility of the null hypothesis. The data will either provide support for the null hypothesis or tend to refute the null hypothesis. The Alpha Level To find the boundaries that separate the high-probability samples from the low-probability samples, we must define exactly what is meant by “low” probability and “high” probability. This is accomplished by selecting a specific probability value , which is known as the level of significance, or the alpha level, for the hypothesis test . The alpha (α) value is a small probability that is used to identify the low-probability samples . By convention, commonly used alpha levels are α = .05 (5%), α = .01 (1 %), and α = .001 (0.1%). 35

The extremely unlikely values, as defined by the alpha level, make up what is called the critical region . The alpha level , or the level of significance , is a probability value that is used to define the concept of “very unlikely” in a hypothesis test. The critical region is composed of the extreme sample values that are very unlikely (as defined by the alpha level) to be obtained if the null hypothesis is true. The boundaries for the critical region are determined by the alpha level. If sample data fall in the critical region, the null hypothesis is rejected. 36

The Boundaries for the Critical Region To determine the exact location for the boundaries that define the critical region, we use the alpha-level probability and the unit. In most cases, the distribution of sample means is normal, and the unit normal table provides the precise z -score location for the critical region boundaries. 37

Degrees of freedom describe the number of scores in a sample that are independent and free to vary. Because the sample mean places a restriction on the value of one score in the sample, there are n – 1 degrees of freedom for a sample with n scores For a sample of n scores, the degrees of freedom , or df , for the sample variance are defined as df = n - 1 . The degrees of freedom determine the number of scores in the sample that are independent and free to vary. 38

The Unit Table The graph shows proportions for only a few selected z -score values. A more complete listing of z -scores and proportions is provided in the unit normal table . This table lists proportions of the normal distribution for a full range of possible z -score values. 39 A normal distribution following a z-score transformation

40

41

STEP 3 Collect data and compute sample statistics The data are as given, so all that remains is to compute the statistic . 42

STEP 4 Make a decision The sample data are located in the critical region. By definition, a sample value in the critical region is very unlikely to occur if the null hypothesis is true. Therefore, we conclude that the sample is not consistent with H and our decision is to reject the null hypothesis . Remember, the null hypothesis states that there is no treatment effect , so rejecting H means we are concluding that the treatment did have an effect . 43

Introduction to the t Statistic The t statistic is used to test hypotheses about an unknown population mean, μ, when the value of σ is unknown. The formula for the t statistic has the same structure as the z- score formula, except that the t statistic uses the estimated standard error in the denominator. 44

The estimated standard error ( S M ) is used as an estimate of the real standard error σ M when the value of σ is unknown. It is computed from the sample variance or sample standard deviation and provides an estimate of the standard distance between a sample mean M and the population mean μ. 45

Exercise related to t test 46

47

The t Test for Two Independent Samples A research design that uses a separate group of participants for each treatment condition (or for each population) is called an independent-measures research design or a between-subjects design. 48

The Estimated Standard Error In each of the t -score formulas, the standard error in the denominator measures how accurately the sample statistic represents the population parameter . In the single-sample t formula, the standard error measures the amount of error expected for a sample mean and is represented by the symbol S M . For the independent measures t formula, the standard error measures the amount of error that is expected when you use a sample mean difference ( M 1 − M 2 ) to represent a population mean difference (μ1 − μ2). The standard error for the sample mean difference is represented by the symbol S (M1 - M2) 49

Pooled Variance One method for correcting the bias in the standard error is to combine the two sample variances into a single value called the pooled variance . The pooled variance is obtained by averaging or “pooling” the two sample variances using a procedure that allows the bigger sample to carry more weight in determining the final value . For the independent-measures t statistic, there are two SS values and two df values ( one from each sample). The values from the two samples are combined to compute what is called the pooled variance . 50

51

52

53

54 Exercise related to independent sample t test

55

56

Introduction to Repeated-Measures Designs A repeated-measures design , or a within-subject design , is one in which the dependent variable is measured two or more times for each individual in a single sample . The same group of subjects is used in all of the treatment conditions . In a repeated-measures design or a matched-subjects design comparing two treatment conditions , the data consist of two sets of scores, which are grouped into sets of two , corresponding to the two scores obtained for each individual or each matched pair of subjects. 57

58 Exercise related to dependent sample t test

59

60

Analysis of Variance ANOVA Analysis of variance (ANOVA) is a hypothesis-testing procedure that is used to evaluate mean differences between two or more treatments (or populations). As with all inferential procedures , ANOVA uses sample data as the basis for drawing general conclusions about populations . It may appear that ANOVA and t tests are simply two different ways of doing exactly the same job: testing for mean differences. In some respects, this is true—both tests use sample data to test hypotheses about population means. However , ANOVA has a tremendous advantage over t tests. Specifically , t tests are limited to situations in which there are only two treatments to compare. The major advantage of ANOVA is that it can be used to compare two or more treatments . 61

There really are no differences between the populations (or treatments). The observed differences between the sample means are caused by random, unsystematic factors (sampling error) that differentiate one sample from another. The populations (or treatments) really do have different means, and these population mean differences are responsible for causing systematic differences between the sample means. 62

In analysis of variance, the variable (independent or quasi-independent) that designates the groups being compared is called a factor . The individual conditions or values that make up a factor are called the levels of the factor . 63

64

65

66

67

The distribution of F -Ratios For ANOVA, we expect F near 1.00 if H is true. An F -ratio that is much larger than 1.00 is an indication that H is not true. In the F distribution, we need to separate those values that are reasonably near 1.00 from the values that are significantly greater than 1.00. 68

69

Exercise related to ANOVA 70

71

72

73

74

The Pearson Correlation The Pearson correlation measures the degree and the direction of the linear relationship between two variables . The Pearson correlation for a sample is identified by the letter r . The corresponding correlation for the entire population is identified by the Greek letter rho (ρ), which is the Greek equivalent of the letter r . 75

76

T he sum of products of deviations, or SP. This new value is similar to SS (the sum of squared deviations ), which is used to measure variability for a single variable. Now, we use SP to measure the amount of co-variability between two variables . In general, the squared correlation ( r 2 ) measures the gain in accuracy that is obtained from using the correlation for prediction. The squared correlation measures the proportion of variability in the data that is explained by the relationship between X and Y . It is sometimes called the coefficient of determination. 77

The value r 2 is called the coefficient of determination because it measures the proportion of variability in one variable that can be determined from the relationship with the other variable. A correlation of r = 0.80 (or –0.80), for example , means that r 2 = 0.64 (or 64%) of the variability in the Y scores can be predicted from the relationship with X . 78

Exercise related to Correlation 79

80

81

For further detail consult this book 82

83