This ppt includes both elementary and advanced biostatistics slides including descriptive and inferential analysis.
Size: 8.44 MB
Language: en
Added: Sep 29, 2024
Slides: 178 pages
Slide Content
Abdiweli Mohamed Abdi Biostatistics
Table of contents Section I Chapter 1: Introduction to Biostatistics Chapter 2: Measures of location Chapter 3; Measures of dispersion Chapter 4; Collection and organization of data Chapter 5; Visualization and presentation of data Chapter 6; Probability and Normal distribution of data Section II Chapter 7; Hypothesis and significance testing Chapter 8; Comparing the significance of two sample and three sample means (z-test, t-test and ANOVA) Chapter 9; Association, correlation and regression Chapter 10: Estimation
Statistics – is a branch of mathematics used for collection, analysis and interpretation of data. Biostatistics- is a branch of statistics used for collection, analysis and interpretation of biological data. Chapter One; Introduction to Biostatistics Thursday, September 26, 2024 3
Statistics Use in Health Issues Biostatistics Use in Agricultural Sector Agri -statistics Use in Business Admin Business Statistics Use in Industrial Sector Industrial Statistics Use in Insurance Actuarial Statistics Use in Economic Sector Economic Statistics Thursday, September 26, 2024 4
Biostatistics Types Inferential Measures of location measures of central tendency measures of other location Measures of Dispersion range variance standard deviation coefficient of variation Estimation Point Estimation Interval Estimation Hypothesis testing z-test t test Anova 2 test C orrelation Regression Descriptive Thursday, September 26, 2024 5
Describes characteristics of data from a sample. Ex: Mean, standard deviation, frequency, and percentage. Descriptive Statistics Thursday, September 26, 2024 6
Ex: prevalence of malaria among a sample of 150 pregnant women = 40%. Can we estimate Prevalence of malaria among the population? Inferential Statistics Thursday, September 26, 2024 7
Why medicine and health science students need to learn Biostatistics? To be able to: Conduct research Identify health problems Monitor and evaluate health programs Thursday, September 26, 2024 8
Variables Variable is any characteristic that differs between individuals, Time or place. Example of Variables: (1) No. of patients (2) Height (3) Sex (4) Educational Level Thursday, September 26, 2024 9
Types of Statistical variables Quantitative/Numerical Variable: is a characteristic that can be measured in numbers. Examples: ( i ) Family Size (ii) No. of patients (iii) Weight (iv) height (v) Age Thursday, September 26, 2024 10
Types of Quantitative Variable Discrete Variables: quantitative variables with no decimals or have gabs b/w numbers Examples: Family size, Number of patients, No. of students, parity, gravidity (b) Continuous Variables: Quantitative variables with decimals or have no gabs b/w numbers Examples: Height, weight, income, blood sugar leve l , creatinine level. Thursday, September 26, 2024 11
2 . Qualitative/Categorical Variable: is a characteristic that its values can be divided into categories. No numbers! Example:- Blood type, Nationality, Students Grades, Educational level, e.t.c . Thursday, September 26, 2024 12
Scales of Measurements Nominal Scale: implies name only no order or rank is involved. E.g. Sex, blood type, institutional departments, nationality. Ordinal Scale: implies name and order or rank. E.g. Educational Level, military rank, students’ grades. Interval Scale: 0 does implies presence of the characteristics. E.g . Temperature and p H Ratio: 0 imply absence of the characteristics both Interval and Ratio between two numbers – are meaningful. eg . Height, Weight, age, income Thursday, September 26, 2024 14
Qualitative Quantitative Variable Interval Ratio Type: Scale: Ordinal Nominal Thursday, September 26, 2024 15
Variables Types Scales of measurement Thursday, September 26, 2024 16
Population A population is the largest collection of objects (elements or individuals) in which we want to draw some conclusions. Populations may be finite or infinite. Thursday, September 26, 2024 17
Example: If we are interested to study the socio-demographic characteristics of students in a class, then our population consists of all those students in the class. Thursday, September 26, 2024 18
Population Size (N): The number of elements in the population is called the population size and is denoted by N . Ex: 100 students in a class. Sample: - A sample is a part of a population from which we collect the data. Ex: 30 students out of 100 students in a class. Thursday, September 26, 2024 19
Population Sample Statistic Parameter Thursday, September 26, 2024 20
Common statistical symbols Title Symbol Sample Mean x Population mean Sample standard deviation s Population standard deviation Sample variance s 2 Population variance 2 Summation Correlation coefficient r Coefficient of determination r 2 Degree of freedom df Thursday, September 26, 2024 21
Title Symbol Chi-square value 2 Sample proportion p Population proportion ∏ Null hypothesis Ho Alternative hypothesis H 1 or H A Sample Size n Type I error error Type II error error Power of the test 1- Thursday, September 26, 2024 22
Chapter Two: Measures of Central tendency and Measures of Other Location a single value around the center of the data used to represent entire data. In a word, measures of central tendency conveys a single information regarding the entire data set. Thursday, September 26, 2024 23
Measures of central tendency are not calculated from qualitative/categorical data Measures of Central tendency include Mean (average) Median Mode Thursday, September 26, 2024 24
Mode Mean The average Median The number or average of the numbers in the middle Mode The number that occurs most Thursday, September 26, 2024 25
Mean Mean is the average of the data set. There are four types of mean Arthematic mean Harmonic mean Geometric mean Weighted mean Thursday, September 26, 2024 26
Arithmetic mean Arithmetic mean is the most familiar measure of central tendency as it is termed as average or mean. Arithmetic mean uses the symbol (readed as X-bar) Thursday, September 26, 2024 27
Arithmetic mean formula: The sum of all observations divided by the total number of observations. = =sum of all observations, n = total number of observations 28
Example-1 Suppose the pulse rates for 10 individuals was recorded as:- 69,70,71,71,72,72,72,75,76,74 Find mean? solution = = = 72.2 = 72.2bits/minute Thursday, September 26, 2024 29
Example-2 The age 12 selected school and university students were 19,18,14,13,22,25,13,22,12,18,14,16 What is the mean age of the selected students? = = =15.58 = 15.58 years Thursday, September 26, 2024 30
Advantages of mean a) Easy to compute b) Takes all data values into account c) Reliable d) It can be calculated if any value is zero or negative. e) Arranging of data is not necessary. Disadvantages of mean a) Highly effected by the extreme value. b) Can not be calculated for qualitative/categorical data. Thursday, September 26, 2024 31
Median In an ordered array, the median is the “middle” number. If n is odd, the median is the middle number. If n is even, the median is the average of the 2 middle numbers Not Affected by Extreme Values Thursday, September 26, 2024 32
Procedure to find Median for Raw data i. Arrange in order ii. Find middle value for odd number : (n+1)/2 for even number : 1 st middle value= n/2 2 nd middle value = (n/2 +1 ) M edian = average of the 1 st and 2 nd middle values Thursday, September 26, 2024 33
Example-: Data: 4 3 7 4 6 1. Arranged in ascending order: 3 4 4 6 7 2. Since it is odd, The middle = (n+1/2= 5+1/2) = 3 rd item The Value in the 3 rd item = 4 Median = 4. Thursday, September 26, 2024 34
Example-: x: 4 3 7 4 6 9 Arranged in ascending order: x: 3 4 4 6 7 9 1 st middle item = 6/2 = 3rd item 2 nd middle item= 6/2= 3+1 = 4th item The value of 3rd and 4th items are: 4 & 6 Median = av. of 4 & 6 = (4+6)/2 = 5. Median=5. Thursday, September 26, 2024 35
Advantages of median A) Easy to compute. B) Not influenced by extreme values . Disadvantages of median Difficult to rank large number of data values. Thursday, September 26, 2024 36
Mode A Measure of Central Tendency Value that Occurs Most Often Not Affected by Extreme Values There May Not be a Mode There May be Several Modes Thursday, September 26, 2024 37
Mode is the Value that Occurs Most Example-: calculate mode for this data set 2,3,4,3,4,5,4 Solution Mode is 4 Thursday, September 26, 2024 38
Advantages Advantages of Mode A) Easy to locate and understand. B) Not influenced by extreme values. C) Is an actual value of the data. Disadvantages of Mode a) Can’t always locate just one mode. b) It does not depend on all observations of the data set. Thursday, September 26, 2024 39
Measures of Other Location Thursday, September 26, 2024 40
Percentiles Percentiles are positional measures that are used to indicate what percent of the data set have a value less than a specified value when the data is divided into hundred parts. Percentiles are not same as percentages. = r r: represents given percentile and n for sample size Thursday, September 26, 2024 41
Deciles Deciles are an other positional measures that are used to indicate how much of the data set have a value less than a specified value when the data is divided into ten parts. = r where r represents given Deciles and n for sample size 42 Thursday, September 26, 2024
Quartiles Quartiles are an other positional measures that are used to indicate how much of the data set have a value less than a specified value when the data is divided into four parts. = r where r represents given quartile (r=1 for Q1, r=2 for Q2 and r=3 for Q3) and n for sample size Thursday, September 26, 2024 43
Example Calculate the 70 th percentile, 6 th decile and Q3 of the following age data 28, 17, 12, 25, 26,19,13,27,21, 16 Percentiles n=10 r= 70 th percentile 1 st Order data into ascending 12,13,16,17,19,21,25,26,27,28 = r = = 7 =7.7 digit Thursday, September 26, 2024 44
7.7 lies somewhere between 25 and 26 To find the exact position we use this formula for fraction percentiles P 70 = decimal*(upper digit value - selected digit value) + selected digit value = 0.7* (26-25=1)= 0.7+25= 25.7 P 70 =25.7, this means that 70 percentile of values lie below 25.7 and 30% of the data lies above 25.7 Thursday, September 26, 2024 45
Deciles Data ordered: 12,13,16,17,19,21,25,26,27,28 Question: Find 6 th decile? Given n=10 r=6 Solution = r = 6 =6.6 Thursday, September 26, 2024 46
So 6.6 decile lies between 21 and 25 To find the exact position we use this formula for fraction deciles= decimal*(upper digit value - selected digit value) + selected digit value = 0.6 * (25-21=4) +21=23.4 Thus the 6 th decile is 23.4 This means that 6 deciles of the data lie below 23.4 Thursday, September 26, 2024 47
Quartiles Data ordered: 12,13,16,17,19,21,25,26,27,28 Question: Find 3 rd Quartile? Given n=10 formula = r r=3 Solution = 3 =8.25 digit Thursday, September 26, 2024 48
So 8.25 decile lies between 26 and 27 To find the exact position we use this formula for fraction quartiles Q 3 = decimal*(upper digit value - selected digit value) + selected digit value = 0.25 * (27-26) + 26 =26.25 Thus Q 3 =26.25 This means that 3 quartiles (75 %) of the data lies below 26.25 Thursday, September 26, 2024 49
Chapter Three Measures of dispersion 50
Measures of dispersion or measures of variation measure variability a set of observations exhibit. They measure how values spread out from each other. The variation is small when the values are close together. There is no dispersion (variation) if the values are the same 51
There are several measures of dispersion, some of which are Range Variance Standard deviation Coefficient of variation Thursday, September 26, 2024 52
The range R ange is the difference between the largest value (maximum) and smallest value (minimum). Rang (R)=Max-Min Example Find the range for the sample values: 26,25,35,27,29 Thursday, September 26, 2024 53
Solution Max=35 Min=25 Range=35-25=10 Notes: The unit of the range is the same as the unit of the data The range is poor measure as it takes into account only two values (Max and Min) 54
The Variance The variance is one of the most important measures of dispersion. The variance is a measure that uses mean as point of reference Sample Variance is taken as symbol (S 2 ) S 2 = Thursday, September 26, 2024 55
The population Variance is taken as symbol ( σ 2 ) σ 2 = Thursday, September 26, 2024 56
Example We want to compute a sample variance of the following sampled health care workers’ income values per week 10, 21, 33, 53, 54 Solution n=5 = = 10+21+33+53+54/5 = 171/5=34.2 Thus = 34.2 USD/week Thursday, September 26, 2024 57
The standard deviation is another measure of deviation. It s square root of the variance. Population standard deviation ( σ )= √ σ 2 Sample standard deviation (S)= √S 2 Standard Deviation Thursday, September 26, 2024 59
Example We want to compute a sample variance of the following sampled health care workers’ income values per week 10, 21, 33, 53, 54 Solution n=5 S 2 =376.7 S=√S 2 = √376.7= 19.41 Thursday, September 26, 2024 60
Coefficient of variation The variance and standard deviation are useful as measure of variation of the values of a single variable for a single population. If we want to compare the variation of two variables we cannot use the variance or the standard deviation because: The variables might have different units. The variables might have different means. Thursday, September 26, 2024 61
We need a measure of the relative variation that will not depend on either the units or on how large the values are. This measure is the coefficient of variation (C.V.). C.V= x100 Thursday, September 26, 2024 62
Example Compare the variability of weights of two groups C.V 1 = x100 = x100=6.8% C.V 2 = x100 = x100=12.5% Since C.V 2 >C.V 1 , the relative variability of the 2 nd group is larger than the relative variability of the 1 st group Groups Mean SD C.V 1 st group 66 kg 4.5 kg 6.8 % 2 nd Group 36 g 4.5 kg 12.5 % 63 Thursday, September 26, 2024
Exercise 1 A student was asked to mention the results of the 5 subjects he/she covered for the last semester and the data was presented as the following: 80, 71, 63, 53, 54 Now calculate : 1] Range 2] variance 3] Standard deviation 64 Thursday, September 26, 2024
Exercise 2 Let us compare the exam results of 2 groups The 1 st group: Mean exam result= 75 Standard deviation= 7.5 The 2 nd group: Mean exam result= 80 Standard deviation= 9 Calculate the variability of results among the 2 groups? 65 Thursday, September 26, 2024
Data: raw, unorganized facts that need to be processed. When data is processed to make it useful, it is called information. 66 Chapter 4; Collection and Organization of data
Types of data 67
Primary Data: Definition: data collected firsthand by the researcher. 68
Primary data collection methods Interviews O bservations, Focus group discussions Blood, body fluid, urine, feces, Imaging (X-ray, US, CT, MRI) 69
Common primary data collection tools 1. Questionnaires 2. Google form 3. Kobo tool box 70
Secondary Data: Definition: data that has been collected by some one else or institution. 71
Journals Books Magazines Newspaper Libraries Websites Medical records SECONDARY DATA SOURCES 72
Organizing data in Array (Ordered Array) A first step in organizing data is the preparation of an ordered array. An ordered array is a listing of the values of data in order of magnitude from the smallest value to the largest value 73
Ex: the following data related to the age of 6 individuals is arranged in array 55 46 58 54 52 69 Ascending form: 46 52 54 55 58 69 Descending form: 69 58 55 54 52 46 Thursday, September 26, 2024 74
Frequency Distribution The most convenient method of organizing data is to construct a frequency distribution. A frequency distribution is the organization of raw data in a table form, using classes and frequencies. 75
Grouped Frequency Distributions When the range of the data is large, the data must be grouped into classes. Class Boundary Definition : Class Boundary: A class boundaries (lower limit on class –upper limit of the previous class) / 2. The difference between the two boundaries of a class gives the class width . The class width is also called the class size . 76
Finding Class Width Class width = Upper boundary - Lower boundary Calculating Class Midpoint or Mark Class midpoint or mark= Thursday, September 26, 2024 77
Example: In the following Table gives the weekly earnings of 100 employees of a large company. The first column lists the classes, which represent the (quantitative) variable: weekly INCOME. 78
79 Weekly Income in USD Number of employee ( Freq ) 801-1000 9 1001-1200 22 1201-1400 39 1401-1600 15 1601-1800 9 1801-2000 6
Calculate Class Boundaries, Class Widths, and Class Midpoints for the above data Solution: A class boundary = (lower limit on class – upper limit of the previous class) / 2 = 1001 – 1000 / 2 = 1 / 2 = 0.5 L ower limit ( 801 – 0.5 ) = 800.5 U pper limit ( 1000 + 0.5 ) = 1000.5 Width of the first class = 1000.5 - 800.5 = 200 Midpoint of the first class = = 900.5 80
81
Constructing Frequency Distribution Tables Important steps for a Constructing of a frequency Distribution for continuous table. 1. The number of classes depends on the range of the data. Range = largest value – smallest value 82
2. Number of class: Number of class should not be too large or too small. As a general rule, the number of classes should be around where n is the number of data values observed. 83
4. Number of columns: usually there will be two columns in a frequency table: class intervals and frequency. 84
Example: the following data represents the number of patients admitted by a hospital i n 30 days. Construct a frequency distribution table. 85
86
Solution: In this data, the minimum value is 5, and the maximum value is 29. Number of class = = 5 Range = largest value – smallest value = 4.8 5 87
Thursday, September 26, 2024 88 Patients admitted Frequency 5-9 3 10-14 6 15-19 8 20-24 8 25-29 5 Total frequency : 30
Example : Calculate the class boundaries relative frequencies and percentages for the table in the previous example 89
Cumulative Frequency Distribution A cumulative frequency distribution gives the total number of values that fall below the upper boundary of each class . 91
Example : Calculate cumulative frequency and cumulative percentages for the table in the previous example 92
Ungrouped frequency distribution of numerical data Data that has not been organized into groups. Also called raw data. Ungrouped data can be either numerical or categorical. 94
Creating a Numerical Ungrouped Frequency Distribution table Step 1- arrange the data in an ascending array. Step 2- count the frequency of each value. Step 3- create a table Step 4- insert the data values in the table 95
Example: Blood Pressure Readings of 8 individuals. 120, 130, 130, 125 , 140, 140, 140, 122. create a frequency distribution table for this data. 96
Step 1- arrange the data in an ascending array . 120, 122, 125, 130, 130, 140, 140 , 140. Step 2- count the frequency of each value . 120 (1), 122 (1), 125 (1), 130 (2), 140 (3). 97
Step 3- create a table Step 4- insert the data values in the table
Creating a Categorical Frequency Distribution table Step 1- count the frequency of each value. Step 2- create a table Step 3- insert the data values in the table 99
Example of ungrouped categorical data related to the blood types of 20 individuals: Blood Types: A, B, O, AB, O, A, B, A, O, B, AB, A, O, B, B, A, O, AB, B, A 100
Step 1- count the frequency of each category. A = 6 individuals B = 5 individuals AB = 5 individuals O = 4 individuals 101
Step 2- create a table Step 3- insert the data in the table Blood Type Frequency A 6 B 5 O 5 AB 4 Total frequency 20 102 :
Relative Frequency and Percentage Distributions Shows what fractional part of the total frequency belongs to the corresponding category. The relative frequency of a category is obtained by dividing the frequency of that category by the sum of all frequencies. 103
104
The percentage for a category is obtained by multiplying the relative frequency of that category by 100. A percentage distribution lists the percentages for all categories . Calculating Percentage Percentage = (Relative frequency) 100 105
Example: Determine the relative frequency and percentage distributions for this data. 106
Chapter five V isualization and presentation of data 107
Techniques of Data presentation Data can be presented in :- Tabular Graphical Thursday, September 26, 2024 108
Tabular data presentation A table contains data in rows and columns. Types of Tables Univariate table Bivariate table Multivariate table 109
Age Frequency Percentage 21-26 6 30 27-32 6 30 33-38 2 10 39-44 3 15 45-50 3 15 Total 20 100 Univarate Table-2: Age 110
Age Male Female Total 21-26 1 5 6 27-32 3 3 6 33-38 2 2 39-44 3 3 45-50 1 2 3 Total 8 12 20 Bivariate Table-1: Sex and Age 111
Multivariate Table-3: Age, sex and residence Thursday, September 26, 2024 112 Gender__ Age Male Female Total Urban Rural Urban Rural 21-26 1 2 5 1 9 27-32 3 2 3 2 10 33-38 1 2 1 4 39-44 3 2 2 7 45-50 1 3 2 1 7 Total 8 10 12 7 37
Graphical presentation of data Tabulation is an important systemic presentation of data but often data is easily revealed by diagrams or graphs. Thursday, September 26, 2024 113
Types of graphical presentation Data Type Type of Table Qualitative Univariate Simple Bar Components Bar Pie chart multiple pie chart Quantitative Histogram Line graph/chart 114
Simple bar Simple bar chart is used for presenting Univariate qualitative data. Bar charts have horizontal axis called X-axis and Vertical axis called Y-axis Categories are putted on X-axis and percentage or Frequency on Y-axis 115
Thursday, September 26, 2024 116
Component Bar To draw component bar, divide 100% into components equal to the number of categories of the variable you want to draw. Thursday, September 26, 2024 117
Thursday, September 26, 2024 118
Pie chart A pie chart is circular statistical graph, which divides the data into slices to illustrate numerical proportion of each category. Thursday, September 26, 2024 119
Thursday, September 26, 2024 120
Multiple bar chart A multiple bar chart is a type of bar chart tat is used for bivariate qualitative data. Using this data construct Multiple bar chart.? Sex Diabetes No diabetes Male 3 5 Female 8 4 Total 11 9 121
Thursday, September 26, 2024 122
Graph for Quantitative variables Graphs used to present quantitative univariate variables include:- Histogram, Line graph/Line chart 123
Histogram Histogram is the common graph for quantitative variables. It is similar to bar chart except that there is no gaps between its bars Thursday, September 26, 2024 124
Thursday, September 26, 2024 125
Chapter Six: Probability and Normal distribution of data Probability is the likelihood of occurrence of an event and is measured by the proportion of times an event occurs. An event is taken by “E”; the number of times event occurs is taken by “n” and all possible events (outcomes) is taken by “N” P(E) = or P(E) = n/N 126
EXAMPLE: 1 A coin is tossed, what is the probability of getting head? Coin has two outcomes, head and tail, so total outcomes (N) is 2 There is only one head, so event (head) =1 P(Head) = = P(Head) = = 0.5 The probability of getting head if coin is tossed is 0.5 or 1/2 Thursday, September 26, 2024 127
EXAMPLE: 2 OPD attendance of a hospital is shown in here What is the probability a randomly selected individual has diabetes? What is the probability a randomly selected individual has hypertension? Diseases Frequency Diabetes 80 Hypertension 40 Total 120 128
Characteristics of Events Events possess certain characteristics, which are:- Mutually exclusive events Mutually non-exclusive events Independent events Dependent events 130
Mutually exclusive Events Events of a trail are called mutually exclusive if an only one event occurs in each single trail. This means that events cannot occur simultaneously that if one event the other can occur. Example: suppose if a coin is tossed, for any toss (trail) there is only one event (either head or tall). 131
Mutually non-exclusive Events events which can occur simultaneously , for example an individual can have only diabetes or only hypertension or both diabetes and hypertension at same time, these events which can occur simultaneously are called mutually non-exclusive. Thursday, September 26, 2024 132
Example: Suppose in OPD attendance there are two categories, people with Diabetes and people with hypertension. However there some people who have both Diabetes and Hypertension. Thus events like Diabetes and Hypertension are considered as Mutually Non-exclusive events 133
Independent Events if A and B are two events of a particular trail and the outcome of event A does not effect and is not effected by the outcome of event B then A and B are called Independent Events. For example: if you toss two coins, the outcome of one first toss (head or tail) is will not affect and it is not affected by the outcome of the second toss. 134
Dependent Events: If outcome of event A influences outcome of event B or B affects A, event A and event B are considered as dependent events. Example: Having smoking and lung cancer Driving a car and getting in a traffic accident Robbing a bank and going to jail. 135
Properties of probability Probability is expressed in proportion. So it takes any value between 0 to 1. However you can show it in percentage, that is it can take 0 to 100%. Probability of 1 means that event is certain to occur (E.g. probability of dying). Probability of 0 means that event is certain not to occur (E.g. probability not dying). 136
A probability of 0.5 means that events have equal chance of occurrence. The higher the probability value, the higher the chance of occurrence and the smaller the probability value, the lower the chance of occurrence. The sum of probability of all events must be equal to 1 or 100% 137
Types of probability According to the time of occurrence of events probability is categorized as :- Priori probability: is calculated before the occurrence of event by logically examining the existing knowledge. It usually deals with the independent events. For example probability of having head or tail is 1/2 or 0.5 138
Posteriori probability: is calculated after the occurrence of the event, that is it is based on frequency of occurrence. For example: number of hypertensive in a sample of 100 patients. 139
Rules of probability There are two basic rules in probability Addition Rule Multiplication Rule Thursday, September 26, 2024 140
Addition Rule This rule applies to both mutually exclusive and mutually non-exclusive events of a single random variable. This rule is characteristics by the term “or” (sometimes ∪ as means of union) in between the two events E.g. P(A or B) sometimes also shown as P(A ∪ B) For mutually exclusive Events P(A or B) = P(A) + P(B) For mutually non-exclusive Events P(A or B) = P(A) + P(B) - P(AB) 141
Example 1 (mutually exclusive Events) A single 6-sided die is rolled. What is the probability of rolling a 2 or a 5? Solution Since 2 and 5 are mutually exclusive , the P (2 and 5) =0 P(2) = 1/6 , P(5) = 1/6 P(2 or 5) = P(2) + P(5) =1/6+1/6 =2/6 =1/3= 0.333 142
Example 2 (mutually exclusive and mutually non exclusive Events) Suppose patients attending a hospital OPD are categorized as in the following table . Disease No. of patients Eye disease 5 Respiratory disease 15 Only Diabetes 90 Only Heart disease 30 Both Diabetes and Heart disease 10 Total 150 143 Thursday, September 26, 2024
If person is drawn at random What is the probability that he/she will have Eye disease or Respiratory disease What is the probability that he/she will have Diabetes or heart disease Thursday, September 26, 2024 144
Solution Eye disease or Respiratory disease (mutually exclusive In here) Patients with eye disease =5 Patients with respiratory disease=15 Total patients =150 P(eye disease or respiratory disease) = 5/150+15/150 = 0.13 Thursday, September 26, 2024 145
b. Diabetes or Heart Disease (mutually Non-exclusive In here) Patients with diabetes =90+10=100 Patients with Heart disease=30+10=40 Total patients =150 P(Diabetes or Heart disease) = P(Diabetes) + P(Heart disease) - P(Diabetes and Heart Disease) P(Diabetes or Heart disease) = 100/150 + 40/150 - 10/150 =0.87 146
Normal Distributions of data In the normal distribution, observations are more clustered around the mean. Normally almost half of the observations lie above the mean and half below the mean and all observations are symmetrically distributed on each side of the mean. 147
Characteristics of Normal Curve/Distribution The normal curve is symmetrical and bell shaped Maximum values at the centre and decrease to zero systemically on each side Mean, median and mode are all equal Mean ± 1SD limits includes 68.2% of all observations Mean ± 2SD limits includes 95% of all observations Mean ± 3SD limits includes 99.7% of all observations
Normal Curve 149 Thursday, September 26, 2024
Skewed Distributions Distributions that are not symmetric and have long tail in one direction are called Skewed Distributions . In skewed distribution, most values are closer to one end and relatively few values in the other direction. 150
Positively Skewed Distributions If the tail of the distribution extend to the right (positive side), the distribution is called Positively Skewed Distribution or right skewed distribution. In right skewed distributions, majority of the values lie at the left part of the distribution. 151
Negatively Skewed Distributions If the tail of the distribution extend to the left (negative side), the distribution is called negatively Skewed Distributions or left skewed distributions. In left skewed distributions, majority of the values lies at the right side of the distribution 152 Thursday, September 26, 2024
Left and Right Skewed Examples 153 Thursday, September 26, 2024
Thursday, September 26, 2024 154
Section II Inferential Biostatistics 155
Inferential Biostatistics Descriptive statistics remains local to the sample, describing its central tendency and variability while inferential statistics focuses on making statements about the population. 156
Statistics Vs. Parameter Statistics(Sample value) Mean ( ) Variance ( 2 ) Standard deviation ( ) Proportion ( ) Parameter (population value) Mean (μ) Variance ( 2 ) Standard deviation ( ) Proportion ( Thursday, September 26, 2024 157
Chapter Seven Hypothesis and significance testing 158
Test of significance is the determination of whether a result is statistically significant or if it could have occurred by chance. 159
Hypothesis It is researchers assumed answer for relationship between two variables or the significance of a test result. There are two statistical hypotheses:- Null Hypothesis Alternative hypothesis Thursday, September 26, 2024 160
Null Hypothesis it states that there is no real difference between statistic and parameter, say sample mean = population mean. Any observed difference is just by chance. Null hypothesis is donated by the symbol of H 0. Thursday, September 26, 2024 161
Alternative hypothesis Alternative hypothesis: it states that there is real difference between statistic and parameter, say sample mean ≠ population mean. Alternative hypothesis is donated by the symbol of H 1 or H a. H 0 = µ 1 =µ 2 H a .= µ 1 ≠ µ 2 When Null hypothesis is rejected, alternative hypothesis is accepted. 162
P-Value P-value indicates the amount of support possessed by the null hypothesis. As the p-value which lies between 0%-100%) approaches to 0, the support (for H0) becomes weaker and weaker while as it approaches to 100, the support is stronger and stronger. Thursday, September 26, 2024 163
Level of significance In order to decide whether the support is strong or weak we need some cut-off value or level. This cut-off value or level is known as level of significance denoted by α. Thursday, September 26, 2024 164
Internationally accepted levels of Significance 10 % (or 0.1) 5% (or 0.05) 1% (or 0.01 ) The most commonly used is 5% (or 0.05) 165
The zone of the null hypothesis acceptance 1 ] If the calculated value is less than the tabulated value, the null hypothesis is accepted and alternative hypothesis is rejected . (Calculation based) 2] If the support of the null hypothesis (p-value ≥ 0.05) the null hypothesis is accepted and alternative hypothesis is rejected. ( Computer Based) 166
The zone of the null hypothesis rejection 1 ] If the calculated value is greater than the tabulated value, the null hypothesis is rejected and alternative hypothesis is accepted. (Calculation based) 2] If the support of the null hypothesis (p-value) is less than the most commonly used significance level (p-value <0.05) the null hypothesis is rejected and alternative hypothesis is accepted ( Computer Based) 167
One-Tailed and Two-Tailed Tests One-Tailed Test The null hypothesis can be tested using either one-tailed or two tailed tests. A test involving null hypothesis that favors only one direction is called one tailed test. Example: suppose a study compares two drugs, drug A and Drug B. Thursday, September 26, 2024 168
So null hypothesis (H ) = Drug A is not more effective than Drug B. and alternative hypothesis (H a ) = Drug A is more effective than Drug B. H Drug A = Drug B H a. Drug A > Drug B Thursday, September 26, 2024 169
Two-tailed Test In Two-tailed Test deviation of both directions are considered when testing. For example: in the previous example of comparing the effectiveness of Drug A and Drug B. The two tailed null hypothesis and alternative hypothesis will be as H 0 = Drug A and Drug B has same effect. H a = Drug A and Drug B has no same effect or in short way: H Drug A=Drug B H a. Drug A ≠ Drug B Thursday, September 26, 2024 170
Thursday, September 26, 2024 171
Steps for Hypothesis Testing Describe the given data State the assumptions (assumption is unexamined belief) State Null and Alternative Hypothesis State Level of significance Choose test statistic (z-test, t-test, ANOVA, X 2 ) Compute the test statistic Thursday, September 26, 2024 172
G) Look the tabulated test statistic responding to significance level or degree of freedom or p-value and compare the calculated test statistic. Or p-value. If the calculated test statistic > the tabulated test statistic Otherwise we will not reject (accept) Null hypothesis. H) Decision: Reject or accept the Null hypothesis. I) Conclusion: conclude in the language of the accepted hypothesis. 173
C hapter Eight Testing the significance difference between two and three sample means Thursday, September 26, 2024 174
Testing the significance difference between two sample means When we want to determine that the difference between two group means are significant (large enough) or insignificant (only due to chance) we do Z-test or t-tests. Here are the decision criteria for using Z-test or t-tests 175
Thursday, September 26, 2024 176
Z-test (normal test) Z or z = Tabulated z values Significance level (α) Two-tailed 1-(alpha/2) One-tailed ,> 1-alpha One-tailed, < 1-alpha 10% (or 0.1) 1.64 1.28 -1.28 5% (or 0.05) 1.96 1.64 -1.64 1% (or 0.01) 2.58 2.33 -2.33 177
Example The mean birth weight of babies born on large community over several years was 2470 gram and standard deviation of 230 gram. Following implementation of ANC program, the mean birth weight obtained from a sample of 40 babies was 2560 gram and standard deviation of 250 gram. Does the ANC program has any impact on birth weight of the new born babies? 178
Solution Data: Given =2470gm, 2560 gm , σ = 230gm, s=250gm, n=40 Assumption: a)birth weight of the baby population is normally distributed b) Sample was selected at random Hypothesis: H : =2470gm (mean birth weight of the populations will not change even after ANC). H a : ≠2470gm (mean birth weight of the populations will change after ANC). Level of significance (α): 5% (0.05) Choose Test statistic: since σ is known, we do Z-test 179
Compute the test statistic Z = Z Compare the calculated Z to the Tabulated z : Tabulated z with 5% level of significance is 1.96 Decision: we reject Null hypothesis since the calculated z (2.47)> the tabulated z(1.96) Conclusion: the mean birth weight of baby born has increased after ANC program implementation. 180
Example-2 The Hemoglobin level of children was measured in 143 girls and 127 boys with known population SD. Here are the results. Here girls have Hb level than boys on average, so the question is whether the observed difference is significant or not? Girls Boys Mean 11.2 11.0 SD 1.4 1.3 n 143 127 Thursday, September 26, 2024 181
Solution Data: Given, , s 1 = 1.4 s 2 =1.3, n 1 =143, n 2 =127 Assumption: a)HB level of the population is normally distributed b) Sample was selected at random Hypothesis: H : (any observed difference is due to by chance alone). H a : : (mean Hb Level of girls and boys are significantly differ) Level of significance (α): 5% (0.05) Choose Test statistic: since n>30, we do Z-test 182
Compute the test statistic z = = = 0.2/0.14119=1.413 Compare the calculated Z to the Tabulated z with 5% level of significance : Tabulated z with 5% level of significance is 1.96 Decision: we accepted Null hypothesis since the calculated z (1.413) is <the tabulated z(1.96) Conclusion: mean Hb Level of girls and boys are not significantly different. 183
t Test 184
t Test is a test for comparing means of one sample as well as means of two sample situations . Types of t test a) One sample t test b) Independent sample t test c) Paired sample t test 185
One sample t test One sample t test is used to test whether a population mean is significantly different from some hypothesized value. t = is sample mean, m is the hypothesized value, s = is sample SD and n = is sample size 186
Example : A professor of Statistics wants to know whether if his introductory statistics class has a good grasp of basic math. Six students were chosen at random from the class and given a math proficiency test. The professor wants the class to be able to score above 70 on the test. The six students get scores of 63, 93, 75, 68, 83, and 92. with SD of 13.17. Can the professor have 95% Confidence that the mean score for the class on the test would be above 70? 187
Since the population standard deviation is not known, we use t test. Solution H = = = 63+93+75+68+83+92/6 = 79 M= above 70 t = t = = s = = 13.17 188
Solution t = df = n-1 = 6-1=5 Note that we are testing only whether the average mean of score of students is greater than 70, so we are dealing with one tailed t-test. 189 Thursday, September 26, 2024
The tabulated t test with 5% significance level and df of 5 is 2.015 Thus the calculated t-test (1.67) is less than the tabulated t-test with df =5 at 5% level of which is 2.015. (Calculated t<tabulated t 0.05,5 )so the null hypothesis is accepted Thursday, September 26, 2024 190
Independent sample t-test Independent sample t-test is used to test the means of two independent groups. Usually a qualitative Dependent variable with two categories and quantitative continues independent variable. Such as the height of male and females, blood pressure of two groups. Example to test whether male income and female income are different or not . t =
Ex: Here is the blood pressure of male and female patients. The question is whether the blood pressure of the patients differs? Solution H0= Ha= t = t = Male Female n 25 25 155 160 S 10 8 Male Female n 25 25 155 160 S 10 8 192
Df = n 1 +n 2 -2 =25+25-2=48 at 5% significance level, the tabulated t =2.021 Thus ignoring the sign t calculated < t tabulated , so null hypothesis is accepted. We can conclude that the two means (the mean male blood pressure and the mean female blood pressure) are not significantly different. 193
Paired sample t test Paired sample t test is used to test the mean difference of two dependent observations, such as blood pressure before exercise and blood after exercise for a group of individuals. In independent t test we were interesting between group differences but in paired t test we are interesting within group difference. , where is the mean difference the two pairs ( eg . before and after) = 194
Example Here is the temperature of 8 individuals before and after the treatment Patient Before (X) After (Y) 1 25.8 24.7 2 26.7 25.8 3 27.3 26.3 4 26.1 25.2 5 26.4 25.5 6 27.4 26.6 7 27.1 26.0 8 26.2 25.0 195
Solution Lets first calculate d and d 2 Patient Before (X) After (Y) d=x-y d 2 1 25.8 24.7 1.1 1.21 2 26.7 25.8 0.9 0.81 3 27.3 26.3 1.0 1.00 4 26.1 25.2 0.9 0.81 5 26.4 25.5 0.9 0.81 6 27.4 26.6 0.8 0.64 7 27.1 26.0 1.1 1.21 8 26.2 25.0 1.2 1.44 ∑d=7.9 ∑d 2 = 7.93 196
= 7.9/8=0.98 s d = (Variance of d)= s d = =0.1 197
The tabulated t value with df 8-1=7 at 5% significance level is 2.365, so the calculated t>tabulated t with 7df at 5% significance level. Decision: Null hypothesis is rejected and alternative hypothesis is accepted. We conclude that the temperature of the individuals before and after treatment is not the same.
Analysis of Variance (ANOVA or F test) 199
Analysis of Variance (ANOVA or F test) Analysis of variance is statistical methods of analyzing data with objective of comparing three or more group means. It replaces t-test that comparing two group means only. Analysis of variance is sometimes called F test, after the British R A Fisher (the British Statistician who developed this test). 200
One way ANOVA : used when we have One continues dependent variable and one categorical independent variable with more two categories, to compare the means of these groups. Example : If we want to know whether people residing three different areas (Rural, Urban and Semi-urban) earn different incomes 201
How to calculate One- Way ANOVA 1) F = MSS BG / MSS WG 2) SS T = or SS BG + SS WG 3) SS BG = = 4) SS WG= SS T - SS BG Thursday, September 26, 2024 202
5) MSS BG = 6) MSS WG= 7) F test = Thursday, September 26, 2024 203
Thursday, September 26, 2024 204
Example Three different treatments are given to 3 groups of patients with anemia. Increase in HB% level was noted after one month and is given in Table 2.0. we are interested to find whether the difference in improvement in3 groups is significant or not. 205
Three different treatments are given to 3 groups of patients with anemia. Increase in HB% level was noted after one month and is given in Table below. we are interested to find whether the difference in improvement in 3 groups is significant or not. Thursday, September 26, 2024 206
Group A Group B Group C x 1 x 2 x 3 3 3 3 1 2 4 2 2 5 3 4 1 1 2 2 3 2 2 2 4 Thursday, September 26, 2024 207
Solution Group A Group B Group C Group A Group B Group C x 1 x 2 x 3 x 1 2 x 2 2 x 3 2 3 3 3 9 9 9 1 2 4 1 4 16 2 2 5 4 4 25 3 4 9 16 1 1 2 1 1 4 2 3 2 4 9 4 2 2 4 4 4 16 =11 =16 =24 2 =23 2 =40 2 =90 Group A Group B Group C Group A Group B Group C x 1 x 2 x 3 x 1 2 x 2 2 x 3 2 3 3 3 9 9 9 1 2 4 1 4 16 2 2 5 4 4 25 3 4 9 16 1 1 2 1 1 4 2 3 2 4 9 4 2 2 4 4 4 16 =23+40+90= 153 =11+16+24= 51 Thursday, September 26, 2024 208
SS T = = = =29.14 SS BG = = = =12.28 4) SS WG= SS T - SS BG =29.14-12.28=16.86 5) MSS BG = = = 6.14 Thursday, September 26, 2024 209
Source of variation Degree of freedom SUM of Squares Mean of Squares F Between Groups K-1 = 3-1= 2df 12.28 6.14 6.53 With in n-K= 21-3=18 16.86 0.94 6) MSS WG= = =0.94 7) F = =6.53 Thursday, September 26, 2024 210
Interpretation The tabulated F value at df 2,18 is 3.55 at 5% level of significance. Our calculated F value is 6.53, that is our calculated F value is greater than the tabulated F value (F calculated > F tabulated= 6.53> 3.55). Thus the null hypothesis is rejected. Hence we conclude at least one of the groups has a significant increase of HB% 211
C hapter Nine Association, Corrélation and prédictions 212
Chi-square Test Thursday, September 26, 2024 213
A chi square ( χ 2) test is useful in making statistical association about two independent categorical variables in which the categories are two and above (but usually two). 214
215
df = (r-1) (c-1), r=number of rows, c=number of columns Example Suppose a researcher wants to test if the knowledge of people is associated with service utilization. He conducted a sample survey of 100 individuals of which 78 had High level of knowledge. Thursday, September 26, 2024 216
Of these 78 who had god knowledge, 50 were service user. Whereas 22 who had low knowledge level, 10 of them used service. Do these data provides evidence of association between knowledge level and service utilization? 217
Thursday, September 26, 2024 218
2. Assumption: data follows a normal distribution and the sample was drawn randomly. 3. Hypothesis: Ho. There is no association between “knowledge level” and “service utilization” Ha. There is association between “knowledge level” and “service utilization” 4. Level of significance: α=5% (0.05) Thursday, September 26, 2024 219
Thursday, September 26, 2024 220
Thursday, September 26, 2024 221
7. Compute the degree of freedom ( df ) df = (r-1) (c-1)= (2-1)(2-1) =1df 8. Tabulated Value of χ 2 : with df =1 and 5% level of significance =3.84 9. Compare the computed value with tabulated value: calculated χ 2 (2.481)<Tabulated χ 2 (3.84) 10. Decision: H0. Is accepted 11. Conclusion: the data does not provide evidence of association between knowledge level and service utilization Thursday, September 26, 2024 222
Correlation analysis 223
When one quantitative variable changes with the change of other quantitative variable they are said to be correlated. The variable that changes the other variable is called Independent variable (IV ) and the variable that is changed is called Dependent ( D V ). The DV is represented by Y and IV is represented by X. 224
Example: Income and Age are both quantitative. They are correlated because when age changes the income changes as well. Therefore Age is (X=IV) while income is (Y=DV). When the change occurs in fixed rate it is called linear correlation . The correlation between one DV and One IV is called Simple correlation. E.g. correlation between Income and Age Thursday, September 26, 2024 225
The correlation between one DV and more IVs is called multiple correlation. E.g. correlation between Income, Age and family size. Correlation Coefficient (r) To calculate the correlation between variables, we use a measure called correlation coefficient (r) Thursday, September 26, 2024 226
227
Characteristics of relationship The correlation coefficient (r) indicates both the strength and direction of relationship. Strength (Magnitude) of the relationship: When correlation coefficient is zero it indicates no correlation. <=0.3= weak correlation . 0.4-0.6= Moderate correlation. 0.7-1= S trong correlation Thursday, September 26, 2024 228
When the correlation coefficient is one (either + or -) it indicates a perfect correlation. As r approaches to 1(either + or -), the strength of the relationship increases. 229
Direction of relationship: the relationship can be positive, negative or no correlation. Positive correlation is when the two variables move the same direction (increase or decrease together). E.g. Gestational period and birth weight. This is when r=+ ve Negative correlation: is when the two variables move on different directions (when one increases the other decreases) E.g. Age and Eye sight. This is when r= - ve Thursday, September 26, 2024 230
No correlation: is when the change in one variable does not influence the change in another variable. E.g. Age and Sex. This is when r=0 Example: Suppose 4 person were selected as a sample to determine the correlation between weight and height Thursday, September 26, 2024 231
Interpretation There is a very strong positive correlation between the weight and height of the respondents. Thursday, September 26, 2024 234
Coefficient of Determination (r2) The square value of r is called coefficient of determination. The coefficient of determination (r2) measures the amount of variability in Y (DV) is explained by X (IV). Coefficient of Determination (r2) is shown as percentage. Thursday, September 26, 2024 235
Example : for the above example correlation coefficient (r) is 0.97, thus coefficient of determination ( r2) is 0.97x0.97=0.94x100 = 94% Interpretation 94% of the variability in the weight (DV) is explained by the height (IV). This means the remaining 6% variability in weight is responsible by other variables but not by height. Thursday, September 26, 2024 236
Correlation Significant Test To test the significance of the correlation value we use the following formula to find calculated T-value t = 0.97*5.77= 5.6 (calculated t-value) Thursday, September 26, 2024 237
Then we go to dependent t-test assuming the significance level of 0.05 we look for Degree of freedom which is in here calculated as n-1 then we go to T-TABLE and look for the junction between the significance level and the degree of freedom and we find the tabulated T-value. The tabulated t-value with two tailed test of 0.05 significance level and a degree of freedom of 3 is: 3.182 Thursday, September 26, 2024 238
Since the calculated t-value of 5.6 is > the tabulated t-value of 3.182, the null hypothesis is rejected. Therefor we can conclude that there is a significant, very strong positive correlation between the height and weight of our participants. 239
Regression Analysis: A statistical procedure used to find relationships among a set of variables In regression analysis, there is a dependent variable , which is the one you are trying to explain, and one or more independent variables that are related to it. Thursday, September 26, 2024 240
REGRESSION TYPES 1) Linear regression = quantitative DV simple (1 dv and 1 IV) multiple (Multiple IV and 1 DV) 2) Logistic regression= qualitative DV A) Binary = DV with 2 categories simple (1 dv and 1 IV) multiple (Multiple IV and 1 DV) B) Multinomial = DV with > 2 categories C) Ordinal = DV which is ordinal. Thursday, September 26, 2024 241
Linear Regression: Linear regression is used when the dependent variable is continuous and assumes a linear relationship with the independent variables. It aims to find the best-fitting line that represents the relationship between the dependent variable and one or more independent variables. Thursday, September 26, 2024 242
For example, a study might use linear regression to determine the relationship between smoking behavior (independent variable) and lung function (dependent variable) among a sample of individuals. Thursday, September 26, 2024 243
Logistic Regression: Logistic regression is used when the dependent variable is categorical or binary. It models the probability of an event occurring or the likelihood of an outcome belonging to a particular category. The dependent variable is usually binary (e.g., yes/no, success/failure), but it can also be multinomial (more than two categories) or ordinal (ordered categories). Thursday, September 26, 2024 244
Why is regression analysis superior compared with chi-square and correlation Prediction capability: Regression analysis allows for prediction that can estimate the value of the dependent variable based on the values of the independent variables. 245
2. Handling both categorical and numerical variables 3. Control of confounding variables: Regression analysis enables researchers to control for the effects of confounding variables by including them as independent variables in the model. Thursday, September 26, 2024 246
Confounding variables: are factors that are associated with both the independent variable(s) and the dependent variable in a study. Age is frequently a confounding variable in health studies. Ex: if studying the association between a specific medication and heart disease risk , age must be considered as a confounding variable because older individuals are more likely to have both higher heart disease risk and higher medication usage Thursday, September 26, 2024 247
Regression equation: Beta0 + Beta1*X Y= Dependent variable X= Independent variable Beta 0 (CONSTANT) = (the value of Y when X is zero). It shows how much DV is if IV is 0. Beta 0 formula= Y-bar – beta 1 * X-bar Thursday, September 26, 2024 248
Beta 1 (Regression co- officient /INTERCEPT) It measures the amount of change in DV (Y) for any change in IV (X ). It represents the relationship between IV and DV. Beta1 = ∑ xy – ( ∑x * ∑y) n ∑ X 2 - (∑X 2 /n) 249
Example 1. The height and weight of 4 individuals were given as presented in the following table. Let us predict how much the weight (DV) of an individual could be if his height (IV) is 80 inches. Thursday, September 26, 2024 250
Weight in POUND (y) Height in inch (x) Y 2 X 2 Xy 240 73 57600 5329 17520 210 70 44100 4900 14700 180 69 32400 4761 12420 160 68 25600 4624 10880 ∑y= 790 ∑x= 280 ∑y 2 = 159700 ∑x 2 =19614 ∑ xy = 55520 251
Beta 0 formula= Y-bar – beta 1 * X-bar= 197.5 -15.7 * 70 = -- - 901.5. Interpretation of Beta 0: if height is 0 the weight will be = -901.5 (a value that does not exist) = 0 252
Beta1 = ∑ xy – ( ∑x * ∑y) = 55520 – (280 *790) n 4 = 220 = 15.7 . ∑X 2 - (∑X 2 /n) = 19614 – (280 2 /4) = 14 Beta1= 15.7 Interpretation of Beta 1: for any unit (inch) change in height there will be 15.7 unit (pounds) change in weight. 253
Regression equation: Beta0 + Beta1*X - 901.5+15.7*80 = 354.5 Interpretation of regression result: based on the distribution of this data If height is 80 inches the weight will be 354 pounds. Thursday, September 26, 2024 254
Chapter TEN Estimation Estimation is a procedure to find values of a parameter based on the value of statistic. There are various techniques available for different situations. We shall, however, limit our discussions on two estimations. There are two types of estimation:- Point Estimation Interval estimation 255
Point Estimation Point Estimation occurs when we estimate that the unknown parameter is equal to the calculated statistic e.g. = μ or = or s= Remember that statistic means sample based summery measure ( and parameter is population based summery measure ( e.g μ Thursday, September 26, 2024 256
Interval estimation Interval estimation occurs when we estimate that the parameter will be included in an interval. This interval is called confidence interval . The likelihood that the parameter will include in the confidence interval is called confidence level. For example 95% Confidence level means, there is 95% likelihood (chance) that the parameter will include the specified interval. 257
Estimation of a single population mean (μ) Example-1: The mean reading speed of a random sample of 81 University students is 325 words per minute. Find the mean reading speed of all Modern students (μ) if it is known that the standard deviation for all Modern students is 45 words per minute. 258
Solution Point Estimation : = μ = as the mean reading speed of a sample is 325 words per minute, then the mean reading speed of all Modern University students is also 325 words per minute Interval Estimation for μ μ = ±Z*SE( ), Z=1.96 SE( )= σ / √n = SE( )= 45 / √81=5 so 1.96*5= 9.8 325 ± 9.8 = 315.2 to 334.8 words/minute This means if 100 samples is selected in university students, the result of 90 of them will include in this range. 259
Estimation of population mean differences(μ 1 -μ 2 ) Example-2:If a random sample of 50 non-smokers have a mean life of 76 years with a standard deviation of 8 years, and a random sample of 65 smokers has a mean live of 68 years with a standard deviation of 9 years, A) What is the point estimate for the difference of the population means? B) Find a 95% C.I. for the difference of mean lifetime of non-smokers and smokers. Thursday, September 26, 2024 260
solution Point Estimation of μ 1 -μ 2 μ 1 -μ 2 = 1 - = as the mean difference of life in the sample is 76-68=8 years, then the mean difference of the population is also 8 years. Interval Estimation of μ 1 -μ 2 μ 1 -μ 2 = 1 - ±1.96*SE( 1 - ), SE( 1 - )= + + = 1.57 = 1.96*1.57= 3 = 8 ±3 = 5 to 11 years So the population mean life difference b/w the two groups will lie in the range from 5 to 11 years . 261
Estimation single population proportion ( Example : An epidemiologist is worried about the ever increasing trend of malaria in a certain locality and wants to estimate the proportion of persons infected in the peak malaria transmission period. If he takes a random sample of 150 persons in that locality during the peak transmission period and finds that 60 of them are positive for malaria, find a) Point estimation for ? b) Find 95% CI ? 262
Solution Point Estimation of p= =40%. That the proportion of malaria positive people in the population is 40%. Interval Estimation of = ±1.96SE( ), SE( )= = SE( )= =0.04 = 1.96*0.04= 0.078*100 =7.8% 40% ±7.8% =32.2% to 47.8% So the proportion of malaria positive individuals in the population will lie between 32.2% to 47.8% 263
Estimation population proportion differences ( 1 - 2 ) Example: Two groups each consists of 100 patients who have leukemia. A new drug is given to the first group but not to the second (the control group). It is found that in the first group 75 people have remission for 2 years; but only 60 in the second group. Find 95% confidence limits for the difference in the proportion of all patients with leukemia who have remission for 2 years. 264
Solution Point Estimation of 1 - 2 1 - 2 = 1 - 2 =75%-60%=15. That is the proportion difference for the two groups is 15% Interval Estimation of 1 - 2 1 - 2 = 1 - 2 ±1.96*SE(p ), SE(p )= = =0.065*100 = 6.5% =1.96*6.5% = 12.7% So 15% ± 12.7%= 2.3% to 27.7% So the population proportion difference will lie somewhere between 2.3% to 27.7% Thursday, September 26, 2024 265