Complete Biostatistics (Descriptive and Inferential analysis)

DrAbdiwaliMohamedAbd 104 views 178 slides Sep 29, 2024
Slide 1
Slide 1 of 267
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130
Slide 131
131
Slide 132
132
Slide 133
133
Slide 134
134
Slide 135
135
Slide 136
136
Slide 137
137
Slide 138
138
Slide 139
139
Slide 140
140
Slide 141
141
Slide 142
142
Slide 143
143
Slide 144
144
Slide 145
145
Slide 146
146
Slide 147
147
Slide 148
148
Slide 149
149
Slide 150
150
Slide 151
151
Slide 152
152
Slide 153
153
Slide 154
154
Slide 155
155
Slide 156
156
Slide 157
157
Slide 158
158
Slide 159
159
Slide 160
160
Slide 161
161
Slide 162
162
Slide 163
163
Slide 164
164
Slide 165
165
Slide 166
166
Slide 167
167
Slide 168
168
Slide 169
169
Slide 170
170
Slide 171
171
Slide 172
172
Slide 173
173
Slide 174
174
Slide 175
175
Slide 176
176
Slide 177
177
Slide 178
178
Slide 179
179
Slide 180
180
Slide 181
181
Slide 182
182
Slide 183
183
Slide 184
184
Slide 185
185
Slide 186
186
Slide 187
187
Slide 188
188
Slide 189
189
Slide 190
190
Slide 191
191
Slide 192
192
Slide 193
193
Slide 194
194
Slide 195
195
Slide 196
196
Slide 197
197
Slide 198
198
Slide 199
199
Slide 200
200
Slide 201
201
Slide 202
202
Slide 203
203
Slide 204
204
Slide 205
205
Slide 206
206
Slide 207
207
Slide 208
208
Slide 209
209
Slide 210
210
Slide 211
211
Slide 212
212
Slide 213
213
Slide 214
214
Slide 215
215
Slide 216
216
Slide 217
217
Slide 218
218
Slide 219
219
Slide 220
220
Slide 221
221
Slide 222
222
Slide 223
223
Slide 224
224
Slide 225
225
Slide 226
226
Slide 227
227
Slide 228
228
Slide 229
229
Slide 230
230
Slide 231
231
Slide 232
232
Slide 233
233
Slide 234
234
Slide 235
235
Slide 236
236
Slide 237
237
Slide 238
238
Slide 239
239
Slide 240
240
Slide 241
241
Slide 242
242
Slide 243
243
Slide 244
244
Slide 245
245
Slide 246
246
Slide 247
247
Slide 248
248
Slide 249
249
Slide 250
250
Slide 251
251
Slide 252
252
Slide 253
253
Slide 254
254
Slide 255
255
Slide 256
256
Slide 257
257
Slide 258
258
Slide 259
259
Slide 260
260
Slide 261
261
Slide 262
262
Slide 263
263
Slide 264
264
Slide 265
265
Slide 266
266
Slide 267
267

About This Presentation

This ppt includes both elementary and advanced biostatistics slides including descriptive and inferential analysis.


Slide Content

Abdiweli Mohamed Abdi Biostatistics

Table of contents Section I Chapter 1: Introduction to Biostatistics Chapter 2: Measures of location Chapter 3; Measures of dispersion Chapter 4; Collection and organization of data Chapter 5; Visualization and presentation of data Chapter 6; Probability and Normal distribution of data Section II Chapter 7; Hypothesis and significance testing Chapter 8; Comparing the significance of two sample and three sample means (z-test, t-test and ANOVA) Chapter 9; Association, correlation and regression Chapter 10: Estimation

Statistics – is a branch of mathematics used for collection, analysis and interpretation of data. Biostatistics- is a branch of statistics used for collection, analysis and interpretation of biological data. Chapter One; Introduction to Biostatistics Thursday, September 26, 2024 3

Statistics Use in Health Issues Biostatistics Use in Agricultural Sector Agri -statistics Use in Business Admin Business Statistics Use in Industrial Sector Industrial Statistics Use in Insurance Actuarial Statistics Use in Economic Sector Economic Statistics Thursday, September 26, 2024 4

Biostatistics Types Inferential Measures of location  measures of central tendency  measures of other location Measures of Dispersion  range  variance  standard deviation  coefficient of variation Estimation  Point Estimation  Interval Estimation Hypothesis testing  z-test  t test  Anova   2 test  C orrelation  Regression Descriptive Thursday, September 26, 2024 5

Describes characteristics of data from a sample. Ex: Mean, standard deviation, frequency, and percentage. Descriptive Statistics Thursday, September 26, 2024 6

Ex: prevalence of malaria among a sample of 150 pregnant women = 40%. Can we estimate Prevalence of malaria among the population? Inferential Statistics Thursday, September 26, 2024 7

Why medicine and health science students need to learn Biostatistics? To be able to: Conduct research Identify health problems Monitor and evaluate health programs Thursday, September 26, 2024 8

Variables Variable is any characteristic that differs between individuals, Time or place. Example of Variables: (1) No. of patients (2) Height (3) Sex (4) Educational Level Thursday, September 26, 2024 9

Types of Statistical variables Quantitative/Numerical Variable: is a characteristic that can be measured in numbers. Examples: ( i ) Family Size (ii) No. of patients (iii) Weight (iv) height (v) Age Thursday, September 26, 2024 10

Types of Quantitative Variable Discrete Variables: quantitative variables with no decimals or have gabs b/w numbers Examples: Family size, Number of patients, No. of students, parity, gravidity (b) Continuous Variables: Quantitative variables with decimals or have no gabs b/w numbers Examples: Height, weight, income, blood sugar leve l , creatinine level. Thursday, September 26, 2024 11

2 . Qualitative/Categorical Variable: is a characteristic that its values can be divided into categories. No numbers! Example:- Blood type, Nationality, Students Grades, Educational level, e.t.c . Thursday, September 26, 2024 12

Discrete (whole number) Qualitative Quantitative Continuos (Decimal) Variable Type: Nature: Thursday, September 26, 2024 13

Scales of Measurements Nominal Scale: implies name only no order or rank is involved. E.g. Sex, blood type, institutional departments, nationality. Ordinal Scale: implies name and order or rank. E.g. Educational Level, military rank, students’ grades. Interval Scale: 0 does implies presence of the characteristics. E.g . Temperature and p H Ratio: 0 imply absence of the characteristics both Interval and Ratio between two numbers – are meaningful. eg . Height, Weight, age, income Thursday, September 26, 2024 14

Qualitative Quantitative Variable Interval Ratio Type: Scale: Ordinal Nominal Thursday, September 26, 2024 15

Variables Types Scales of measurement Thursday, September 26, 2024 16

Population A population is the largest collection of objects (elements or individuals) in which we want to draw some conclusions. Populations may be finite or infinite. Thursday, September 26, 2024 17

Example: If we are interested to study the socio-demographic characteristics of students in a class, then our population consists of all those students in the class. Thursday, September 26, 2024 18

Population Size (N): The number of elements in the population is called the population size and is denoted by N . Ex: 100 students in a class. Sample: - A sample is a part of a population from which we collect the data. Ex: 30 students out of 100 students in a class. Thursday, September 26, 2024 19

Population Sample Statistic Parameter Thursday, September 26, 2024 20

Common statistical symbols Title Symbol Sample Mean x Population mean  Sample standard deviation s Population standard deviation  Sample variance s 2 Population variance  2 Summation  Correlation coefficient r Coefficient of determination r 2 Degree of freedom df Thursday, September 26, 2024 21

Title Symbol Chi-square value 2 Sample proportion p Population proportion ∏ Null hypothesis Ho Alternative hypothesis H 1 or H A Sample Size n Type I error  error Type II error  error Power of the test 1-  Thursday, September 26, 2024 22

Chapter Two: Measures of Central tendency and Measures of Other Location a single value around the center of the data used to represent entire data. In a word, measures of central tendency conveys a single information regarding the entire data set. Thursday, September 26, 2024 23

Measures of central tendency are not calculated from qualitative/categorical data Measures of Central tendency include Mean (average) Median Mode Thursday, September 26, 2024 24

Mode Mean The average Median The number or average of the numbers in the middle Mode The number that occurs most Thursday, September 26, 2024 25

Mean Mean is the average of the data set. There are four types of mean Arthematic mean Harmonic mean Geometric mean Weighted mean Thursday, September 26, 2024 26

Arithmetic mean Arithmetic mean is the most familiar measure of central tendency as it is termed as average or mean. Arithmetic mean uses the symbol (readed as X-bar)   Thursday, September 26, 2024 27

Arithmetic mean formula: The sum of all observations divided by the total number of observations. = =sum of all observations, n = total number of observations   28

Example-1 Suppose the pulse rates for 10 individuals was recorded as:- 69,70,71,71,72,72,72,75,76,74 Find mean? solution = = = 72.2 = 72.2bits/minute   Thursday, September 26, 2024 29

Example-2 The age 12 selected school and university students were 19,18,14,13,22,25,13,22,12,18,14,16 What is the mean age of the selected students? = = =15.58 = 15.58 years   Thursday, September 26, 2024 30

Advantages of mean a) Easy to compute b) Takes all data values into account c) Reliable d) It can be calculated if any value is zero or negative. e) Arranging of data is not necessary. Disadvantages of mean a) Highly effected by the extreme value. b) Can not be calculated for qualitative/categorical data. Thursday, September 26, 2024 31

Median In an ordered array, the median is the “middle” number. If n is odd, the median is the middle number. If n is even, the median is the average of the 2 middle numbers Not Affected by Extreme Values Thursday, September 26, 2024 32

Procedure to find Median for Raw data i. Arrange in order ii. Find middle value  for odd number : (n+1)/2  for even number : 1 st middle value= n/2 2 nd middle value = (n/2 +1 ) M edian = average of the 1 st and 2 nd middle values Thursday, September 26, 2024 33

Example-: Data: 4 3 7 4 6 1. Arranged in ascending order: 3 4 4 6 7 2. Since it is odd, The middle = (n+1/2= 5+1/2) = 3 rd item The Value in the 3 rd item = 4  Median = 4. Thursday, September 26, 2024 34

Example-: x: 4 3 7 4 6 9 Arranged in ascending order: x: 3 4 4 6 7 9 1 st middle item = 6/2 = 3rd item 2 nd middle item= 6/2= 3+1 = 4th item The value of 3rd and 4th items are: 4 & 6 Median = av. of 4 & 6 = (4+6)/2 = 5. Median=5. Thursday, September 26, 2024 35

Advantages of median A) Easy to compute. B) Not influenced by extreme values . Disadvantages of median Difficult to rank large number of data values. Thursday, September 26, 2024 36

Mode A Measure of Central Tendency Value that Occurs Most Often Not Affected by Extreme Values There May Not be a Mode There May be Several Modes Thursday, September 26, 2024 37

Mode is the Value that Occurs Most Example-: calculate mode for this data set 2,3,4,3,4,5,4 Solution Mode is 4 Thursday, September 26, 2024 38

Advantages Advantages of Mode A) Easy to locate and understand. B) Not influenced by extreme values. C) Is an actual value of the data. Disadvantages of Mode a) Can’t always locate just one mode. b) It does not depend on all observations of the data set. Thursday, September 26, 2024 39

Measures of Other Location Thursday, September 26, 2024 40

Percentiles Percentiles are positional measures that are used to indicate what percent of the data set have a value less than a specified value when the data is divided into hundred parts. Percentiles are not same as percentages. = r r: represents given percentile and n for sample size   Thursday, September 26, 2024 41

Deciles Deciles are an other positional measures that are used to indicate how much of the data set have a value less than a specified value when the data is divided into ten parts. = r where r represents given Deciles and n for sample size   42 Thursday, September 26, 2024

Quartiles Quartiles are an other positional measures that are used to indicate how much of the data set have a value less than a specified value when the data is divided into four parts. = r where r represents given quartile (r=1 for Q1, r=2 for Q2 and r=3 for Q3) and n for sample size   Thursday, September 26, 2024 43

Example Calculate the 70 th percentile, 6 th decile and Q3 of the following age data 28, 17, 12, 25, 26,19,13,27,21, 16 Percentiles n=10 r= 70 th percentile 1 st Order data into ascending 12,13,16,17,19,21,25,26,27,28 = r = = 7 =7.7 digit   Thursday, September 26, 2024 44

7.7 lies somewhere between 25 and 26 To find the exact position we use this formula for fraction percentiles P 70 = decimal*(upper digit value - selected digit value) + selected digit value = 0.7* (26-25=1)= 0.7+25= 25.7 P 70 =25.7, this means that 70 percentile of values lie below 25.7 and 30% of the data lies above 25.7 Thursday, September 26, 2024 45

Deciles Data ordered: 12,13,16,17,19,21,25,26,27,28 Question: Find 6 th decile? Given n=10 r=6 Solution = r = 6 =6.6   Thursday, September 26, 2024 46

So 6.6 decile lies between 21 and 25 To find the exact position we use this formula for fraction deciles= decimal*(upper digit value - selected digit value) + selected digit value = 0.6 * (25-21=4) +21=23.4 Thus the 6 th decile is 23.4 This means that 6 deciles of the data lie below 23.4 Thursday, September 26, 2024 47

Quartiles Data ordered: 12,13,16,17,19,21,25,26,27,28 Question: Find 3 rd Quartile? Given n=10 formula = r r=3 Solution = 3 =8.25 digit   Thursday, September 26, 2024 48

So 8.25 decile lies between 26 and 27 To find the exact position we use this formula for fraction quartiles Q 3 = decimal*(upper digit value - selected digit value) + selected digit value = 0.25 * (27-26) + 26 =26.25 Thus Q 3 =26.25 This means that 3 quartiles (75 %) of the data lies below 26.25 Thursday, September 26, 2024 49

Chapter Three Measures of dispersion 50

Measures of dispersion or measures of variation measure variability a set of observations exhibit. They measure how values spread out from each other. The variation is small when the values are close together. There is no dispersion (variation) if the values are the same 51

There are several measures of dispersion, some of which are Range Variance Standard deviation Coefficient of variation Thursday, September 26, 2024 52

The range R ange is the difference between the largest value (maximum) and smallest value (minimum). Rang (R)=Max-Min Example Find the range for the sample values: 26,25,35,27,29 Thursday, September 26, 2024 53

Solution Max=35 Min=25 Range=35-25=10 Notes: The unit of the range is the same as the unit of the data The range is poor measure as it takes into account only two values (Max and Min) 54

The Variance The variance is one of the most important measures of dispersion. The variance is a measure that uses mean as point of reference Sample Variance is taken as symbol (S 2 ) S 2 =   Thursday, September 26, 2024 55

The population Variance is taken as symbol ( σ 2 ) σ 2 =   Thursday, September 26, 2024 56

Example We want to compute a sample variance of the following sampled health care workers’ income values per week 10, 21, 33, 53, 54 Solution n=5 = = 10+21+33+53+54/5 = 171/5=34.2 Thus = 34.2 USD/week   Thursday, September 26, 2024 57

S 2 = = = 376.7   ) 2 10 10-34.2 =-24.2 (-24.2) 2 =585.64 21 21-34.2 = -13.2 (-13.2) 2 =174.24 33 33-34.2 = -1.2 (-1.2) 2 =1.44 53 53-34.2 =18.8 (18.8) 2 =353.44 54 54-34.2 =19.8 (19.8) 2 =392.04 =171 =0 ) 2 =1506.8 10 10-34.2 =-24.2 (-24.2) 2 =585.64 21 21-34.2 = -13.2 (-13.2) 2 =174.24 33 33-34.2 = -1.2 (-1.2) 2 =1.44 53 53-34.2 =18.8 (18.8) 2 =353.44 54 54-34.2 =19.8 (19.8) 2 =392.04 Thursday, September 26, 2024 58

The standard deviation is another measure of deviation. It s square root of the variance. Population standard deviation ( σ )= √ σ 2 Sample standard deviation (S)= √S 2 Standard Deviation Thursday, September 26, 2024 59

Example We want to compute a sample variance of the following sampled health care workers’ income values per week 10, 21, 33, 53, 54 Solution n=5 S 2 =376.7 S=√S 2 = √376.7= 19.41 Thursday, September 26, 2024 60

Coefficient of variation The variance and standard deviation are useful as measure of variation of the values of a single variable for a single population. If we want to compare the variation of two variables we cannot use the variance or the standard deviation because: The variables might have different units. The variables might have different means. Thursday, September 26, 2024 61

We need a measure of the relative variation that will not depend on either the units or on how large the values are. This measure is the coefficient of variation (C.V.). C.V= x100   Thursday, September 26, 2024 62

Example Compare the variability of weights of two groups C.V 1 = x100 = x100=6.8% C.V 2 = x100 = x100=12.5% Since C.V 2 >C.V 1 , the relative variability of the 2 nd group is larger than the relative variability of the 1 st group   Groups Mean SD C.V 1 st group 66 kg 4.5 kg 6.8 % 2 nd Group 36 g 4.5 kg 12.5 % 63 Thursday, September 26, 2024

Exercise 1 A student was asked to mention the results of the 5 subjects he/she covered for the last semester and the data was presented as the following: 80, 71, 63, 53, 54 Now calculate : 1] Range 2] variance 3] Standard deviation 64 Thursday, September 26, 2024

Exercise 2 Let us compare the exam results of 2 groups The 1 st group: Mean exam result= 75 Standard deviation= 7.5 The 2 nd group: Mean exam result= 80 Standard deviation= 9 Calculate the variability of results among the 2 groups? 65 Thursday, September 26, 2024

Data: raw, unorganized facts that need to be processed. When data is processed to make it useful, it is called information. 66 Chapter 4; Collection and Organization of data

Types of data 67

Primary Data: Definition:  data collected firsthand by the researcher. 68

Primary data collection methods Interviews O bservations, Focus group discussions Blood, body fluid, urine, feces, Imaging (X-ray, US, CT, MRI) 69

Common primary data collection tools 1. Questionnaires 2. Google form 3. Kobo tool box 70

Secondary Data: Definition:  data that has been collected by some one else or institution. 71

Journals Books Magazines Newspaper Libraries Websites Medical records SECONDARY DATA SOURCES 72

Organizing data in Array (Ordered Array) A first step in organizing data is the preparation of an ordered array. An ordered array is a listing of the values of data in order of magnitude from the smallest value to the largest value 73

Ex: the following data related to the age of 6 individuals is arranged in array 55 46 58 54 52 69 Ascending form: 46 52 54 55 58 69 Descending form: 69 58 55 54 52 46 Thursday, September 26, 2024 74

Frequency Distribution The most convenient method of organizing data is to construct a frequency distribution. A frequency distribution is the organization of raw data in a table form, using classes and frequencies. 75

Grouped Frequency Distributions When the range of the data is large, the data must be grouped into classes. Class Boundary Definition : Class Boundary: A class boundaries (lower limit on class –upper limit of the previous class) / 2. The difference between the two boundaries of a class gives the class width . The class width is also called the class size . 76

Finding Class Width Class width = Upper boundary - Lower boundary Calculating Class Midpoint or Mark Class midpoint or mark=   Thursday, September 26, 2024 77

Example: In the following Table gives the weekly earnings of 100 employees of a large company. The first column lists the classes, which represent the (quantitative) variable: weekly INCOME. 78

79 Weekly Income in USD Number of employee ( Freq ) 801-1000 9 1001-1200 22 1201-1400 39 1401-1600 15 1601-1800 9 1801-2000 6

Calculate Class Boundaries, Class Widths, and Class Midpoints for the above data Solution: A class boundary = (lower limit on class – upper limit of the previous class) / 2 = 1001 – 1000 / 2 = 1 / 2 = 0.5 L ower limit ( 801 – 0.5 ) = 800.5 U pper limit ( 1000 + 0.5 ) = 1000.5 Width of the first class = 1000.5 - 800.5 = 200 Midpoint of the first class = = 900.5   80

81

Constructing Frequency Distribution Tables Important steps for a Constructing of a frequency Distribution for continuous table. 1. The number of classes depends on the range of the data. Range = largest value – smallest value 82

2. Number of class: Number of class should not be too large or too small. As a general rule, the number of classes should be around where n is the number of data values observed.   83

4. Number of columns: usually there will be two columns in a frequency table: class intervals and frequency.   84

Example: the following data represents the number of patients admitted by a hospital i n 30 days. Construct a frequency distribution table.   85

86

Solution: In this data, the minimum value is 5, and the maximum value is 29. Number of class = = 5 Range = largest value – smallest value = 4.8 5   87

Thursday, September 26, 2024 88 Patients admitted Frequency 5-9 3 10-14 6 15-19 8 20-24 8 25-29 5 Total frequency : 30

Example : Calculate the class boundaries relative frequencies and percentages for the table in the previous example 89

90 Patients admitted Frequency Relative frequency Percentage (%) 5-9 3 3/30= 0.1 0.1x100= 10 10-14 6 6/30= 0.2 0.2x100= 20 15-19 8 8/30= 0.267 0.267x100= 26.7 20-24 8 8/30= 0.267 0.267x100= 26.7 25-29 5 5/30= 0.167 0.167x100= 16.7 Total 30 1 100

Cumulative Frequency Distribution A cumulative frequency distribution gives the total number of values that fall below the upper boundary of each class . 91

Example : Calculate cumulative frequency and cumulative percentages for the table in the previous example 92

Thursday, September 26, 2024 93 Patients admitted Frequency Cumulative relative frequency Percentage (%) Cumulative Percentage 5-9 3 3/30=0.100 0.1x100= 10 10 10-14 6 9/30=0.300 0.2x100= 20 30 15-19 8 17/30=0.567 0.267x100= 26.7 56.7 20-24 8 25/30=0.833 0.267x100= 26.7 83.3 25-29 5 30/30=1 0.167x100= 16.7 100 Total 30 100

Ungrouped frequency distribution of numerical data Data that has not been organized into groups. Also called raw data. Ungrouped data can be either numerical or categorical. 94

Creating a Numerical Ungrouped Frequency Distribution table Step 1- arrange the data in an ascending array. Step 2- count the frequency of each value. Step 3- create a table Step 4- insert the data values in the table 95

Example: Blood Pressure Readings of 8 individuals. 120, 130, 130, 125 , 140, 140, 140, 122. create a frequency distribution table for this data. 96

Step 1- arrange the data in an ascending array . 120, 122, 125, 130, 130, 140, 140 , 140. Step 2- count the frequency of each value . 120 (1), 122 (1), 125 (1), 130 (2), 140 (3). 97

Step 3- create a table Step 4- insert the data values in the table

Creating a Categorical Frequency Distribution table Step 1- count the frequency of each value. Step 2- create a table Step 3- insert the data values in the table 99

Example of ungrouped categorical data related to the blood types of 20 individuals: Blood Types: A, B, O, AB, O, A, B, A, O, B, AB, A, O, B, B, A, O, AB, B, A 100

Step 1- count the frequency of each category. A = 6 individuals B = 5 individuals AB = 5 individuals O = 4 individuals 101

Step 2- create a table Step 3- insert the data in the table Blood Type Frequency A 6 B 5 O 5 AB 4 Total frequency 20 102 :

Relative Frequency and Percentage Distributions Shows what fractional part of the total frequency belongs to the corresponding category. The relative frequency of a category is obtained by dividing the frequency of that category by the sum of all frequencies. 103

104

The percentage for a category is obtained by multiplying the relative frequency of that category by 100. A percentage distribution lists the percentages for all categories . Calculating Percentage Percentage = (Relative frequency) 100   105

Example: Determine the relative frequency and percentage distributions for this data.   106

Chapter five V isualization and presentation of data 107

Techniques of Data presentation Data can be presented in :- Tabular Graphical Thursday, September 26, 2024 108

Tabular data presentation A table contains data in rows and columns. Types of Tables Univariate table Bivariate table Multivariate table 109

Age Frequency Percentage 21-26 6 30 27-32 6 30 33-38 2 10 39-44 3 15 45-50 3 15 Total 20 100 Univarate Table-2: Age 110

Age Male Female Total 21-26 1 5 6 27-32 3 3 6 33-38 2 2 39-44 3 3 45-50 1 2 3 Total 8 12 20 Bivariate Table-1: Sex and Age 111

Multivariate Table-3: Age, sex and residence Thursday, September 26, 2024 112 Gender__ Age Male Female Total Urban Rural Urban Rural 21-26 1 2 5 1 9 27-32 3 2 3 2 10 33-38 1 2 1 4 39-44 3 2 2 7 45-50 1 3 2 1 7 Total 8 10 12 7 37

Graphical presentation of data Tabulation is an important systemic presentation of data but often data is easily revealed by diagrams or graphs. Thursday, September 26, 2024 113

Types of graphical presentation Data Type Type of Table Qualitative Univariate Simple Bar Components Bar Pie chart multiple pie chart Quantitative Histogram Line graph/chart 114

Simple bar Simple bar chart is used for presenting Univariate qualitative data. Bar charts have horizontal axis called X-axis and Vertical axis called Y-axis Categories are putted on X-axis and percentage or Frequency on Y-axis 115

Thursday, September 26, 2024 116

Component Bar To draw component bar, divide 100% into components equal to the number of categories of the variable you want to draw. Thursday, September 26, 2024 117

Thursday, September 26, 2024 118

Pie chart A  pie chart  is circular  statistical  graph, which divides the data into slices to illustrate numerical proportion of each category. Thursday, September 26, 2024 119

Thursday, September 26, 2024 120

Multiple bar chart A multiple bar chart is a type of bar chart tat is used for bivariate qualitative data. Using this data construct Multiple bar chart.? Sex Diabetes No diabetes Male 3 5 Female 8 4 Total 11 9 121

Thursday, September 26, 2024 122

Graph for Quantitative variables Graphs used to present quantitative univariate variables include:- Histogram, Line graph/Line chart 123

Histogram Histogram is the common graph for quantitative variables. It is similar to bar chart except that there is no gaps between its bars Thursday, September 26, 2024 124

Thursday, September 26, 2024 125

Chapter Six: Probability and Normal distribution of data Probability is the likelihood of occurrence of an event and is measured by the proportion of times an event occurs. An event is taken by “E”; the number of times event occurs is taken by “n” and all possible events (outcomes) is taken by “N” P(E) = or P(E) = n/N   126

EXAMPLE: 1 A coin is tossed, what is the probability of getting head? Coin has two outcomes, head and tail, so total outcomes (N) is 2 There is only one head, so event (head) =1 P(Head) = = P(Head) = = 0.5 The probability of getting head if coin is tossed is 0.5 or 1/2   Thursday, September 26, 2024 127

EXAMPLE: 2 OPD attendance of a hospital is shown in here What is the probability a randomly selected individual has diabetes? What is the probability a randomly selected individual has hypertension? Diseases Frequency Diabetes 80 Hypertension 40 Total 120 128

Solution P(Diabetes) = = P(Diabetes) = = 0.67 P(Hypertension) = = P(Hypertension) = = 0.33   129

Characteristics of Events Events possess certain characteristics, which are:- Mutually exclusive events Mutually non-exclusive events Independent events Dependent events 130

Mutually exclusive Events Events of a trail are called mutually exclusive if an only one event occurs in each single trail. This means that events cannot occur simultaneously that if one event the other can occur. Example: suppose if a coin is tossed, for any toss (trail) there is only one event (either head or tall). 131

Mutually non-exclusive Events events which can occur simultaneously , for example an individual can have only diabetes or only hypertension or both diabetes and hypertension at same time, these events which can occur simultaneously are called mutually non-exclusive. Thursday, September 26, 2024 132

Example: Suppose in OPD attendance there are two categories, people with Diabetes and people with hypertension. However there some people who have both Diabetes and Hypertension. Thus events like Diabetes and Hypertension are considered as Mutually Non-exclusive events 133

Independent Events if A and B are two events of a particular trail and the outcome of event A does not effect and is not effected by the outcome of event B then A and B are called Independent Events. For example: if you toss two coins, the outcome of one first toss (head or tail) is will not affect and it is not affected by the outcome of the second toss. 134

Dependent Events: If outcome of event A influences outcome of event B or B affects A, event A and event B are considered as dependent events. Example: Having smoking and lung cancer Driving a car and getting in a traffic accident Robbing a bank and going to jail. 135

Properties of probability Probability is expressed in proportion. So it takes any value between 0 to 1. However you can show it in percentage, that is it can take 0 to 100%. Probability of 1 means that event is certain to occur (E.g. probability of dying). Probability of 0 means that event is certain not to occur (E.g. probability not dying). 136

A probability of 0.5 means that events have equal chance of occurrence. The higher the probability value, the higher the chance of occurrence and the smaller the probability value, the lower the chance of occurrence. The sum of probability of all events must be equal to 1 or 100% 137

Types of probability According to the time of occurrence of events probability is categorized as :- Priori probability: is calculated before the occurrence of event by logically examining the existing knowledge. It usually deals with the independent events. For example probability of having head or tail is 1/2 or 0.5 138

Posteriori probability: is calculated after the occurrence of the event, that is it is based on frequency of occurrence. For example: number of hypertensive in a sample of 100 patients. 139

Rules of probability There are two basic rules in probability Addition Rule Multiplication Rule Thursday, September 26, 2024 140

Addition Rule This rule applies to both mutually exclusive and mutually non-exclusive events of a single random variable. This rule is characteristics by the term “or” (sometimes ∪ as means of union) in between the two events E.g. P(A or B) sometimes also shown as P(A ∪ B) For mutually exclusive Events P(A or B) = P(A) + P(B) For mutually non-exclusive Events P(A or B) = P(A) + P(B) - P(AB) 141

Example 1 (mutually exclusive Events)   A single 6-sided die is rolled. What is the probability of rolling a 2 or a 5? Solution Since 2 and 5 are mutually exclusive , the P (2 and 5) =0 P(2) = 1/6 , P(5) = 1/6 P(2 or 5) = P(2) + P(5) =1/6+1/6 =2/6 =1/3= 0.333 142

Example 2 (mutually exclusive and mutually non exclusive Events)   Suppose patients attending a hospital OPD are categorized as in the following table . Disease No. of patients Eye disease 5 Respiratory disease 15 Only Diabetes 90 Only Heart disease 30 Both Diabetes and Heart disease 10 Total 150 143 Thursday, September 26, 2024

If person is drawn at random What is the probability that he/she will have Eye disease or Respiratory disease What is the probability that he/she will have Diabetes or heart disease Thursday, September 26, 2024 144

Solution Eye disease or Respiratory disease (mutually exclusive In here) Patients with eye disease =5 Patients with respiratory disease=15 Total patients =150 P(eye disease or respiratory disease) = 5/150+15/150 = 0.13 Thursday, September 26, 2024 145

b. Diabetes or Heart Disease (mutually Non-exclusive In here) Patients with diabetes =90+10=100 Patients with Heart disease=30+10=40 Total patients =150 P(Diabetes or Heart disease) = P(Diabetes) + P(Heart disease) - P(Diabetes and Heart Disease) P(Diabetes or Heart disease) = 100/150 + 40/150 - 10/150 =0.87 146

Normal Distributions of data In the normal distribution, observations are more clustered around the mean. Normally almost half of the observations lie above the mean and half below the mean and all observations are symmetrically distributed on each side of the mean. 147

Characteristics of Normal Curve/Distribution The normal curve is symmetrical and bell shaped Maximum values at the centre and decrease to zero systemically on each side Mean, median and mode are all equal Mean ± 1SD limits includes 68.2% of all observations Mean ± 2SD limits includes 95% of all observations Mean ± 3SD limits includes 99.7% of all observations

Normal Curve 149 Thursday, September 26, 2024

Skewed Distributions Distributions that are not symmetric and have long tail in one direction are called Skewed Distributions . In skewed distribution, most values are closer to one end and relatively few values in the other direction. 150

Positively Skewed Distributions If the tail of the distribution extend to the right (positive side), the distribution is called Positively Skewed Distribution or right skewed distribution. In right skewed distributions, majority of the values lie at the left part of the distribution. 151

Negatively Skewed Distributions If the tail of the distribution extend to the left (negative side), the distribution is called negatively Skewed Distributions or left skewed distributions. In left skewed distributions, majority of the values lies at the right side of the distribution 152 Thursday, September 26, 2024

Left and Right Skewed Examples 153 Thursday, September 26, 2024

Thursday, September 26, 2024 154

Section II Inferential Biostatistics 155

Inferential Biostatistics Descriptive statistics remains local to the sample, describing its central tendency and variability while inferential statistics focuses on making statements about the population. 156

Statistics Vs. Parameter Statistics(Sample value) Mean ( ) Variance ( 2 ) Standard deviation ( ) Proportion ( )   Parameter (population value) Mean (μ) Variance ( 2 ) Standard deviation ( ) Proportion (     Thursday, September 26, 2024 157

Chapter Seven Hypothesis and significance testing 158

Test of significance is the determination of whether a result is statistically significant or if it could have occurred by chance. 159

Hypothesis It is researchers assumed answer for relationship between two variables or the significance of a test result. There are two statistical hypotheses:- Null Hypothesis Alternative hypothesis Thursday, September 26, 2024 160

Null Hypothesis it states that there is no real difference between statistic and parameter, say sample mean = population mean. Any observed difference is just by chance. Null hypothesis is donated by the symbol of H 0. Thursday, September 26, 2024 161

Alternative hypothesis Alternative hypothesis: it states that there is real difference between statistic and parameter, say sample mean ≠ population mean. Alternative hypothesis is donated by the symbol of H 1 or H a. H 0 = µ 1 =µ 2 H a .= µ 1 ≠ µ 2 When Null hypothesis is rejected, alternative hypothesis is accepted. 162

P-Value P-value indicates the amount of support possessed by the null hypothesis. As the p-value which lies between 0%-100%) approaches to 0, the support (for H0) becomes weaker and weaker while as it approaches to 100, the support is stronger and stronger. Thursday, September 26, 2024 163

Level of significance In order to decide whether the support is strong or weak we need some cut-off value or level. This cut-off value or level is known as level of significance denoted by α. Thursday, September 26, 2024 164

Internationally accepted levels of Significance 10 % (or 0.1) 5% (or 0.05) 1% (or 0.01 ) The most commonly used is 5% (or 0.05) 165

The zone of the null hypothesis acceptance 1 ] If the calculated value is less than the tabulated value, the null hypothesis is accepted and alternative hypothesis is rejected . (Calculation based) 2] If the support of the null hypothesis (p-value ≥ 0.05) the null hypothesis is accepted and alternative hypothesis is rejected. ( Computer Based) 166

The zone of the null hypothesis rejection 1 ] If the calculated value is greater than the tabulated value, the null hypothesis is rejected and alternative hypothesis is accepted. (Calculation based) 2] If the support of the null hypothesis (p-value) is less than the most commonly used significance level (p-value <0.05) the null hypothesis is rejected and alternative hypothesis is accepted ( Computer Based) 167

One-Tailed and Two-Tailed Tests One-Tailed Test The null hypothesis can be tested using either one-tailed or two tailed tests. A test involving null hypothesis that favors only one direction is called one tailed test. Example: suppose a study compares two drugs, drug A and Drug B. Thursday, September 26, 2024 168

So null hypothesis (H ) = Drug A is not more effective than Drug B. and alternative hypothesis (H a ) = Drug A is more effective than Drug B. H Drug A = Drug B H a. Drug A > Drug B Thursday, September 26, 2024 169

Two-tailed Test In Two-tailed Test deviation of both directions are considered when testing. For example: in the previous example of comparing the effectiveness of Drug A and Drug B. The two tailed null hypothesis and alternative hypothesis will be as H 0 = Drug A and Drug B has same effect. H a = Drug A and Drug B has no same effect or in short way: H Drug A=Drug B H a. Drug A ≠ Drug B Thursday, September 26, 2024 170

Thursday, September 26, 2024 171

Steps for Hypothesis Testing Describe the given data State the assumptions (assumption is unexamined belief) State Null and Alternative Hypothesis State Level of significance Choose test statistic (z-test, t-test, ANOVA, X 2 ) Compute the test statistic Thursday, September 26, 2024 172

G) Look the tabulated test statistic responding to significance level or degree of freedom or p-value and compare the calculated test statistic. Or p-value. If the calculated test statistic > the tabulated test statistic Otherwise we will not reject (accept) Null hypothesis. H) Decision: Reject or accept the Null hypothesis. I) Conclusion: conclude in the language of the accepted hypothesis. 173

C hapter Eight Testing the significance difference between two and three sample means Thursday, September 26, 2024 174

Testing the significance difference between two sample means When we want to determine that the difference between two group means are significant (large enough) or insignificant (only due to chance) we do Z-test or t-tests. Here are the decision criteria for using Z-test or t-tests 175

Thursday, September 26, 2024 176

Z-test (normal test) Z or z =   Tabulated z values Significance level (α) Two-tailed 1-(alpha/2) One-tailed ,> 1-alpha One-tailed, < 1-alpha 10% (or 0.1) 1.64 1.28 -1.28 5% (or 0.05) 1.96 1.64 -1.64 1% (or 0.01) 2.58 2.33 -2.33 177

Example The mean birth weight of babies born on large community over several years was 2470 gram and standard deviation of 230 gram. Following implementation of ANC program, the mean birth weight obtained from a sample of 40 babies was 2560 gram and standard deviation of 250 gram. Does the ANC program has any impact on birth weight of the new born babies? 178

Solution Data: Given =2470gm, 2560 gm , σ = 230gm, s=250gm, n=40 Assumption: a)birth weight of the baby population is normally distributed b) Sample was selected at random Hypothesis: H : =2470gm (mean birth weight of the populations will not change even after ANC). H a : ≠2470gm (mean birth weight of the populations will change after ANC). Level of significance (α): 5% (0.05) Choose Test statistic: since σ is known, we do Z-test   179

Compute the test statistic Z = Z Compare the calculated Z to the Tabulated z : Tabulated z with 5% level of significance is 1.96 Decision: we reject Null hypothesis since the calculated z (2.47)> the tabulated z(1.96) Conclusion: the mean birth weight of baby born has increased after ANC program implementation.   180

Example-2 The Hemoglobin level of children was measured in 143 girls and 127 boys with known population SD. Here are the results. Here girls have Hb level than boys on average, so the question is whether the observed difference is significant or not?   Girls Boys Mean 11.2 11.0 SD 1.4 1.3 n 143 127 Thursday, September 26, 2024 181

Solution Data: Given, , s 1 = 1.4 s 2 =1.3, n 1 =143, n 2 =127 Assumption: a)HB level of the population is normally distributed b) Sample was selected at random Hypothesis: H : (any observed difference is due to by chance alone). H a : : (mean Hb Level of girls and boys are significantly differ) Level of significance (α): 5% (0.05) Choose Test statistic: since n>30, we do Z-test   182

Compute the test statistic z = = = 0.2/0.14119=1.413 Compare the calculated Z to the Tabulated z with 5% level of significance : Tabulated z with 5% level of significance is 1.96 Decision: we accepted Null hypothesis since the calculated z (1.413) is <the tabulated z(1.96) Conclusion: mean Hb Level of girls and boys are not significantly different.   183

t Test 184

t Test is a test for comparing means of one sample as well as means of two sample situations . Types of t test a) One sample t test b) Independent sample t test c) Paired sample t test 185

One sample t test One sample t test is used to test whether a population mean is significantly different from some hypothesized value. t = is sample mean, m is the hypothesized value, s = is sample SD and n = is sample size   186

Example : A professor of Statistics wants to know whether if his introductory statistics class has a good grasp of basic math. Six students were chosen at random from the class and given a math proficiency test. The professor wants the class to be able to score above 70 on the test. The six students get scores of 63, 93, 75, 68, 83, and 92. with SD of 13.17. Can the professor have 95% Confidence that the mean score for the class on the test would be above 70? 187

Since the population standard deviation is not known, we use t test. Solution H = = = 63+93+75+68+83+92/6 = 79 M= above 70 t = t = = s = = 13.17   188

Solution t = df = n-1 = 6-1=5 Note that we are testing only whether the average mean of score of students is greater than 70, so we are dealing with one tailed t-test.   189 Thursday, September 26, 2024

The tabulated t test with 5% significance level and df of 5 is 2.015 Thus the calculated t-test (1.67) is less than the tabulated t-test with df =5 at 5% level of which is 2.015. (Calculated t<tabulated t 0.05,5 )so the null hypothesis is accepted Thursday, September 26, 2024 190

Independent sample t-test Independent sample t-test is used to test the means of two independent groups. Usually a qualitative Dependent variable with two categories and quantitative continues independent variable. Such as the height of male and females, blood pressure of two groups. Example to test whether male income and female income are different or not . t =  

Ex: Here is the blood pressure of male and female patients. The question is whether the blood pressure of the patients differs? Solution H0= Ha= t = t =     Male Female n 25 25 155 160 S 10 8   Male Female n 25 25 155 160 S 10 8 192

Df = n 1 +n 2 -2 =25+25-2=48 at 5% significance level, the tabulated t =2.021 Thus ignoring the sign t calculated < t tabulated , so null hypothesis is accepted. We can conclude that the two means (the mean male blood pressure and the mean female blood pressure) are not significantly different. 193

Paired sample t test Paired sample t test is used to test the mean difference of two dependent observations, such as blood pressure before exercise and blood after exercise for a group of individuals. In independent t test we were interesting between group differences but in paired t test we are interesting within group difference. , where is the mean difference the two pairs ( eg . before and after) =   194

Example Here is the temperature of 8 individuals before and after the treatment Patient Before (X) After (Y) 1 25.8 24.7 2 26.7 25.8 3 27.3 26.3 4 26.1 25.2 5 26.4 25.5 6 27.4 26.6 7 27.1 26.0 8 26.2 25.0 195

Solution Lets first calculate d and d 2 Patient Before (X) After (Y) d=x-y d 2 1 25.8 24.7 1.1 1.21 2 26.7 25.8 0.9 0.81 3 27.3 26.3 1.0 1.00 4 26.1 25.2 0.9 0.81 5 26.4 25.5 0.9 0.81 6 27.4 26.6 0.8 0.64 7 27.1 26.0 1.1 1.21 8 26.2 25.0 1.2 1.44       ∑d=7.9 ∑d 2 = 7.93 196

= 7.9/8=0.98 s d = (Variance of d)= s d = =0.1   197

The tabulated t value with df 8-1=7 at 5% significance level is 2.365, so the calculated t>tabulated t with 7df at 5% significance level. Decision: Null hypothesis is rejected and alternative hypothesis is accepted. We conclude that the temperature of the individuals before and after treatment is not the same.

Analysis of Variance (ANOVA or F test) 199

Analysis of Variance (ANOVA or F test) Analysis of variance is statistical methods of analyzing data with objective of comparing three or more group means. It replaces t-test that comparing two group means only. Analysis of variance is sometimes called F test, after the British R A Fisher (the British Statistician who developed this test). 200

One way ANOVA : used when we have One continues dependent variable and one categorical independent variable with more two categories, to compare the means of these groups. Example : If we want to know whether people residing three different areas (Rural, Urban and Semi-urban) earn different incomes 201

How to calculate One- Way ANOVA 1) F = MSS BG / MSS WG 2) SS T = or SS BG + SS WG 3) SS BG = = 4) SS WG= SS T - SS BG   Thursday, September 26, 2024 202

5) MSS BG = 6) MSS WG= 7) F test =   Thursday, September 26, 2024 203

Thursday, September 26, 2024 204

Example Three different treatments are given to 3 groups of patients with anemia. Increase in HB% level was noted after one month and is given in Table 2.0. we are interested to find whether the difference in improvement in3 groups is significant or not. 205

Three different treatments are given to 3 groups of patients with anemia. Increase in HB% level was noted after one month and is given in Table below. we are interested to find whether the difference in improvement in 3 groups is significant or not. Thursday, September 26, 2024 206

Group A Group B Group C x 1 x 2 x 3 3 3 3 1 2 4 2 2 5 3 4 1 1 2 2 3 2 2 2 4 Thursday, September 26, 2024 207

Solution Group A Group B Group C Group A Group B Group C x 1 x 2 x 3 x 1 2 x 2 2 x 3 2 3 3 3 9 9 9 1 2 4 1 4 16 2 2 5 4 4 25 3 4 9 16 1 1 2 1 1 4 2 3 2 4 9 4 2 2 4 4 4 16 =11 =16 =24 2 =23 2 =40 2 =90 Group A Group B Group C Group A Group B Group C x 1 x 2 x 3 x 1 2 x 2 2 x 3 2 3 3 3 9 9 9 1 2 4 1 4 16 2 2 5 4 4 25 3 4 9 16 1 1 2 1 1 4 2 3 2 4 9 4 2 2 4 4 4 16 =23+40+90= 153   =11+16+24= 51   Thursday, September 26, 2024 208

SS T = = = =29.14 SS BG = = = =12.28 4) SS WG= SS T - SS BG =29.14-12.28=16.86 5) MSS BG = = = 6.14   Thursday, September 26, 2024 209

Source of variation Degree of freedom SUM of Squares Mean of Squares F Between Groups K-1 = 3-1= 2df 12.28 6.14 6.53 With in n-K= 21-3=18 16.86 0.94 6) MSS WG= = =0.94   7) F = =6.53   Thursday, September 26, 2024 210

Interpretation The tabulated F value at df 2,18 is 3.55 at 5% level of significance. Our calculated F value is 6.53, that is our calculated F value is greater than the tabulated F value (F calculated > F tabulated= 6.53> 3.55). Thus the null hypothesis is rejected. Hence we conclude at least one of the groups has a significant increase of HB% 211

C hapter Nine Association, Corrélation and prédictions 212

Chi-square Test Thursday, September 26, 2024 213

A chi square ( χ 2) test is useful in making statistical association about two independent categorical variables in which the categories are two and above (but usually two). 214

215

df = (r-1) (c-1), r=number of rows, c=number of columns Example Suppose a researcher wants to test if the knowledge of people is associated with service utilization. He conducted a sample survey of 100 individuals of which 78 had High level of knowledge. Thursday, September 26, 2024 216

Of these 78 who had god knowledge, 50 were service user. Whereas 22 who had low knowledge level, 10 of them used service. Do these data provides evidence of association between knowledge level and service utilization? 217

Thursday, September 26, 2024 218

2. Assumption: data follows a normal distribution and the sample was drawn randomly. 3. Hypothesis: Ho. There is no association between “knowledge level” and “service utilization” Ha. There is association between “knowledge level” and “service utilization” 4. Level of significance: α=5% (0.05) Thursday, September 26, 2024 219

Thursday, September 26, 2024 220

Thursday, September 26, 2024 221

7. Compute the degree of freedom ( df ) df = (r-1) (c-1)= (2-1)(2-1) =1df 8. Tabulated Value of χ 2 : with df =1 and 5% level of significance =3.84 9. Compare the computed value with tabulated value: calculated χ 2 (2.481)<Tabulated χ 2 (3.84) 10. Decision: H0. Is accepted 11. Conclusion: the data does not provide evidence of association between knowledge level and service utilization Thursday, September 26, 2024 222

Correlation analysis 223

When one quantitative variable changes with the change of other quantitative variable they are said to be correlated. The variable that changes the other variable is called Independent variable (IV ) and the variable that is changed is called Dependent ( D V ). The DV is represented by Y and IV is represented by X. 224

Example: Income and Age are both quantitative. They are correlated because when age changes the income changes as well. Therefore Age is (X=IV) while income is (Y=DV). When the change occurs in fixed rate it is called linear correlation . The correlation between one DV and One IV is called Simple correlation. E.g. correlation between Income and Age Thursday, September 26, 2024 225

The correlation between one DV and more IVs is called multiple correlation. E.g. correlation between Income, Age and family size. Correlation Coefficient (r) To calculate the correlation between variables, we use a measure called correlation coefficient (r) Thursday, September 26, 2024 226

227

Characteristics of relationship The correlation coefficient (r) indicates both the strength and direction of relationship. Strength (Magnitude) of the relationship: When correlation coefficient is zero it indicates no correlation. <=0.3= weak correlation . 0.4-0.6= Moderate correlation. 0.7-1= S trong correlation Thursday, September 26, 2024 228

When the correlation coefficient is one (either + or -) it indicates a perfect correlation. As r approaches to 1(either + or -), the strength of the relationship increases. 229

Direction of relationship: the relationship can be positive, negative or no correlation. Positive correlation is when the two variables move the same direction (increase or decrease together). E.g. Gestational period and birth weight. This is when r=+ ve Negative correlation: is when the two variables move on different directions (when one increases the other decreases) E.g. Age and Eye sight. This is when r= - ve Thursday, September 26, 2024 230

No correlation: is when the change in one variable does not influence the change in another variable. E.g. Age and Sex. This is when r=0 Example: Suppose 4 person were selected as a sample to determine the correlation between weight and height Thursday, September 26, 2024 231

Weight in Pound (Y) Height in inches (X) Y 2 X 2 XY 240 73 57600 5329 17520 210 70 44100 4900 14700 180 69 32400 4761 12420 160 68 25600 4624 10880 ∑y: 790 ∑X: 280 ∑Y 2 : 159700 ∑X 2 : 19614 ∑XY: 55520 Thursday, September 26, 2024 232

Thursday, September 26, 2024 233

Interpretation There is a very strong positive correlation between the weight and height of the respondents. Thursday, September 26, 2024 234

Coefficient of Determination (r2) The square value of r is called coefficient of determination. The coefficient of determination (r2) measures the amount of variability in Y (DV) is explained by X (IV). Coefficient of Determination (r2) is shown as percentage. Thursday, September 26, 2024 235

Example : for the above example correlation coefficient (r) is 0.97, thus coefficient of determination ( r2) is 0.97x0.97=0.94x100 = 94% Interpretation 94% of the variability in the weight (DV) is explained by the height (IV). This means the remaining 6% variability in weight is responsible by other variables but not by height. Thursday, September 26, 2024 236

Correlation Significant Test To test the significance of the correlation value we use the following formula to find calculated T-value t = 0.97*5.77= 5.6 (calculated t-value) Thursday, September 26, 2024 237

Then we go to dependent t-test assuming the significance level of 0.05 we look for Degree of freedom which is in here calculated as n-1 then we go to T-TABLE and look for the junction between the significance level and the degree of freedom and we find the tabulated T-value. The tabulated t-value with two tailed test of 0.05 significance level and a degree of freedom of 3 is: 3.182 Thursday, September 26, 2024 238

Since the calculated t-value of 5.6 is > the tabulated t-value of 3.182, the null hypothesis is rejected. Therefor we can conclude that there is a significant, very strong positive correlation between the height and weight of our participants. 239

Regression Analysis: A statistical procedure used to find relationships among a set of variables In regression analysis, there is a dependent variable , which is the one you are trying to explain, and one or more independent variables that are related to it. Thursday, September 26, 2024 240

REGRESSION TYPES 1) Linear regression = quantitative DV simple (1 dv and 1 IV) multiple (Multiple IV and 1 DV) 2) Logistic regression= qualitative DV A) Binary = DV with 2 categories simple (1 dv and 1 IV) multiple (Multiple IV and 1 DV) B) Multinomial = DV with > 2 categories C) Ordinal = DV which is ordinal. Thursday, September 26, 2024 241

Linear Regression: Linear regression is used when the dependent variable is continuous and assumes a linear relationship with the independent variables. It aims to find the best-fitting line that represents the relationship between the dependent variable and one or more independent variables. Thursday, September 26, 2024 242

For example, a study might use linear regression to determine the relationship between smoking behavior (independent variable) and lung function (dependent variable) among a sample of individuals. Thursday, September 26, 2024 243

Logistic Regression: Logistic regression is used when the dependent variable is categorical or binary. It models the probability of an event occurring or the likelihood of an outcome belonging to a particular category. The dependent variable is usually binary (e.g., yes/no, success/failure), but it can also be multinomial (more than two categories) or ordinal (ordered categories). Thursday, September 26, 2024 244

Why is regression analysis superior compared with chi-square and correlation Prediction capability: Regression analysis allows for prediction that can estimate the value of the dependent variable based on the values of the independent variables. 245

2. Handling both categorical and numerical variables 3. Control of confounding variables: Regression analysis enables researchers to control for the effects of confounding variables by including them as independent variables in the model. Thursday, September 26, 2024 246

Confounding variables: are factors that are associated with both the independent variable(s) and the dependent variable in a study. Age is frequently a confounding variable in health studies. Ex: if studying the association between a specific medication and heart disease risk , age must be considered as a confounding variable because older individuals are more likely to have both higher heart disease risk and higher medication usage Thursday, September 26, 2024 247

Regression equation: Beta0 + Beta1*X Y= Dependent variable X= Independent variable Beta 0 (CONSTANT) = (the value of Y when X is zero). It shows how much DV is if IV is 0. Beta 0 formula= Y-bar – beta 1 * X-bar Thursday, September 26, 2024 248

Beta 1 (Regression co- officient /INTERCEPT) It measures the amount of change in DV (Y) for any change in IV (X ). It represents the relationship between IV and DV. Beta1 = ∑ xy – ( ∑x * ∑y) n ∑ X 2 - (∑X 2 /n) 249

Example 1. The height and weight of 4 individuals were given as presented in the following table. Let us predict how much the weight (DV) of an individual could be if his height (IV) is 80 inches. Thursday, September 26, 2024 250

Weight in POUND (y) Height in inch (x) Y 2 X 2 Xy 240 73 57600 5329 17520 210 70 44100 4900 14700 180 69 32400 4761 12420 160 68 25600 4624 10880 ∑y= 790 ∑x= 280 ∑y 2 = 159700 ∑x 2 =19614 ∑ xy = 55520 251

Beta 0 formula= Y-bar – beta 1 * X-bar= 197.5 -15.7 * 70 = -- - 901.5. Interpretation of Beta 0: if height is 0 the weight will be = -901.5 (a value that does not exist) = 0 252

Beta1 = ∑ xy – ( ∑x * ∑y) = 55520 – (280 *790) n 4 = 220 = 15.7 . ∑X 2 - (∑X 2 /n) = 19614 – (280 2 /4) = 14 Beta1= 15.7 Interpretation of Beta 1: for any unit (inch) change in height there will be 15.7 unit (pounds) change in weight. 253

Regression equation: Beta0 + Beta1*X - 901.5+15.7*80 = 354.5 Interpretation of regression result: based on the distribution of this data If height is 80 inches the weight will be 354 pounds. Thursday, September 26, 2024 254

Chapter TEN Estimation Estimation is a procedure to find values of a parameter based on the value of statistic. There are various techniques available for different situations. We shall, however, limit our discussions on two estimations. There are two types of estimation:- Point Estimation Interval estimation 255

Point Estimation Point Estimation occurs when we estimate that the unknown parameter is equal to the calculated statistic e.g. = μ or = or s= Remember that statistic means sample based summery measure ( and parameter is population based summery measure ( e.g μ   Thursday, September 26, 2024 256

Interval estimation Interval estimation occurs when we estimate that the parameter will be included in an interval. This interval is called confidence interval . The likelihood that the parameter will include in the confidence interval is called confidence level. For example 95% Confidence level means, there is 95% likelihood (chance) that the parameter will include the specified interval. 257

Estimation of a single population mean (μ) Example-1: The mean reading speed of a random sample of 81 University students is 325 words per minute. Find the mean reading speed of all Modern students (μ) if it is known that the standard deviation for all Modern students is 45 words per minute. 258

Solution Point Estimation : = μ = as the mean reading speed of a sample is 325 words per minute, then the mean reading speed of all Modern University students is also 325 words per minute Interval Estimation for μ μ = ±Z*SE( ), Z=1.96 SE( )= σ / √n = SE( )= 45 / √81=5 so 1.96*5= 9.8 325 ± 9.8 = 315.2 to 334.8 words/minute This means if 100 samples is selected in university students, the result of 90 of them will include in this range.   259

Estimation of population mean differences(μ 1 -μ 2 ) Example-2:If a random sample of 50 non-smokers have a mean life of 76 years with a standard deviation of 8 years, and a random sample of 65 smokers has a mean live of 68 years with a standard deviation of 9 years, A) What is the point estimate for the difference of the population means? B) Find a 95% C.I. for the difference of mean lifetime of non-smokers and smokers. Thursday, September 26, 2024 260

solution Point Estimation of μ 1 -μ 2 μ 1 -μ 2 = 1 - = as the mean difference of life in the sample is 76-68=8 years, then the mean difference of the population is also 8 years. Interval Estimation of μ 1 -μ 2 μ 1 -μ 2 = 1 - ±1.96*SE( 1 - ), SE( 1 - )= + + = 1.57 = 1.96*1.57= 3 = 8 ±3 = 5 to 11 years So the population mean life difference b/w the two groups will lie in the range from 5 to 11 years .   261

Estimation single population proportion (   Example : An epidemiologist is worried about the ever increasing trend of malaria in a certain locality and wants to estimate the proportion of persons infected in the peak malaria transmission period. If he takes a random sample of 150 persons in that locality during the peak transmission period and finds that 60 of them are positive for malaria, find a) Point estimation for ? b) Find 95% CI ?   262

Solution Point Estimation of p= =40%. That the proportion of malaria positive people in the population is 40%. Interval Estimation of = ±1.96SE( ), SE( )= = SE( )= =0.04 = 1.96*0.04= 0.078*100 =7.8% 40% ±7.8% =32.2% to 47.8% So the proportion of malaria positive individuals in the population will lie between 32.2% to 47.8%   263

Estimation population proportion differences ( 1 - 2 )   Example: Two groups each consists of 100 patients who have leukemia. A new drug is given to the first group but not to the second (the control group). It is found that in the first group 75 people have remission for 2 years; but only 60 in the second group. Find 95% confidence limits for the difference in the proportion of all patients with leukemia who have remission for 2 years. 264

Solution Point Estimation of 1 - 2 1 - 2 = 1 - 2 =75%-60%=15. That is the proportion difference for the two groups is 15% Interval Estimation of 1 - 2 1 - 2 = 1 - 2 ±1.96*SE(p ), SE(p )= = =0.065*100 = 6.5% =1.96*6.5% = 12.7% So 15% ± 12.7%= 2.3% to 27.7% So the population proportion difference will lie somewhere between 2.3% to 27.7%   Thursday, September 26, 2024 265

Questions ?
Tags