03. Summarizing data biostatic - Copy.pptx

timhirteshetu6125 20 views 62 slides Mar 06, 2025
Slide 1
Slide 1 of 62
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62

About This Presentation

summarizing data


Slide Content

CHAPTER THREE Summarizing Data

Introduction The basic problem of statistics can be stated as follows: Consider a sample of data X 1 , …….. X n , where X 1 corresponds to the first sample point and X n corresponds to the nth sample point Notations: ∑ is read as Sigma (the Greek Capital letter for S) means the sum of Suppose n values of a variable are denoted as x 1 , x 2 , x 3 …., x n .

Cont’d.. ∑x i = x 1 + x 2 + x 3 +…x n ∑x i 2 = x 1 2 +x 2 2 + x 3 2 +…x n 2 (∑x i ) 2 =( x 1 + x 2 + x 3 +…x n ) 2, where the subscript i range from 1 up to n Example: Let x 1 =2, x 2 = 5, x 3 =1, x 4 =4, x 5 =10, x 6 = −5, x 7 = 8 Since there are 7 observations, i range from 1 up to 7

Introduction… i ) ∑x i = 2+5+1+4+10-5+8 = 25 ii) (∑x i ) 2 = (25) 2 = 625 iii) ∑x i 2 = 4 + 25 + 1 + 16 + 100 + 25 + 64 = 235 Example 2. 21 12 15 12 15 13 10 11 8 7 6 4 Compute a) ∑x i b) (∑x i ) 2 c) ∑x i 2

Summarizing Data There are two methods , which are commonly used Measuring Central Tendency (MCT) Measuring Variability/Dispersion

I. Measuring Central Tendency (MCT) The tendency of statistical data to get concentrated at certain values is called “Central Tendency” The various methods of determining the actual value at which the data tend to concentrate are called measures of central tendency or average The most important objective of calculating MCT is to determine a single figure which may be used to represent a whole series involving magnitude of the variable Since a MCT represents the entire data, it facilitates comparison with in one group or b/n groups of data

Characteristics of a good MCT It should be based on all observations It should not be affected by extreme values It should have a definite value It should not be subjected to complicated computation It should be capable of further algebraic treatment It should be close to the location were majority of the observations are located

Commonly used MCT 1. The Arithmetic Mean or simple Mean 2. Median 3. Mode 4. Geometric mean 5. The Harmonic Mean (HM) Average: a figure that best represents the location of the distribution

1. The Arithmetic Mean or Mean Is the sum of all observations divided by the number of observations, or Sum of the values divided by the number of cases Is called an average Usually abbreviated to ‘mean’ Most familiar measure of central tendency

A. Mean for Ungrouped Data:

Mean for Ungrouped Data… Example: We use the following data set of 10 numbers to illustrate the computations: 19 21 20 20 34 22 24 27 27 27 Then, mean = (19 + 21 + … +27) = 24.1 10

B. Mean for Grouped Data Assume all values in the interval are located at the mid point of the interval. The formula is given as: Where: k is the number of class intervals m i is the mid point of the i th class interval f i is the frequency of the i th class interval n is total number of observations NB: Each value within the interval is represented by the midpoint of the true class interval

Mean… the arithmetic mean is a very natural measure of central location however one of its principal limitations is that it is overly sensitive to extreme values

Characteristics of Mean The value of the arithmetic mean is determined by every item in the series It is greatly affected by extreme values The sum of the deviations about it is zero The sum of the squares of deviations from the arithmetic mean is less than of those computed from any other point

Advantages & Disadvantages of mean Advantages 1) It is based on all values given in the distribution 2) It is most early understood 3) It is most amenable to algebraic treatment Disadvantages 1) It may be greatly affected by extreme items and its usefulness as a “Summary of the whole” may be considerably reduced 2) When the distribution has open-end classes, its computation would be based assumption, and therefore may not be valid

2. Median is the value which divides the data into two equal halves , with half of the values being lower than the median and half higher than the median the median represents the middle of the ordered sample data when the sample size is odd, the median is the middle value when the sample size is even, the median is the midpoint/mean of the two middle values.

Median… When n is the number of observation in a dataset, the median is calculated in such a way: Sort the values into ascending order. If you have an odd number of observations, the median is the middle observation If you have an even number of observations, the median is the arithmetic mean of the two middle observations

Median… If the number of observations is odd: - Median = ( n+1) th observation. 2 If the number of observations is even:- the median is the average of the two middle: Median =( n ) th and ( n + 1) th observations 2 2

Median… Example 1: Compute the median for {1, 2, 3, 4, 5} The numbers are already sorted, so that it is easy to see that the median is 3 (two numbers are less than 3 and two are bigger) Example 2: Compute the median for {1, 2, 3, 4, 5, 6} The median would be 3.5 since that is the middle between 3 and 4, computed as (3 + 4)/ 2 Note that three numbers are less than 3.5, and three are bigger, as the definition of the median requires

Median… Exercise1 : Compute the median of the following sample data. a) 12 11 54 55 23 15 22 18 10 b) 11 8 6 9 20 18 13 14 2. Consider the following data, which consists of white blood counts taken on admission of all patients entering a small hospital on a given day. Compute the median white-blood count (×103). 7, 35,5,9,8,3,10,12,8

Median for Grouped data Where:- L m = lower true class boundary of the median class F c = cumulative frequency of the class interval just above the median class (median class=n/2) f m = absolute frequency of the median class W= class width (class with of the median class) n = total number of observations

Median… Example 3: Consider the following grouped data on the amount of time ( in hours) that 80 college students devoted to leisure activities during a typical school week. Time Frequency Cumulative feq 10-14 8 8 15-19 28 36 20-24 27 63 25-29 12 75 30-34 4 79 35-39 1 80 Total 80

Characteristics of median 1) It is an average of position It is affected by the number of items rather than by extreme values Advantages It is easily calculated and is not much affected by extreme values It is more typical of the series It may be located even when the data are incomplete, e.g, when the class intervals are irregular and the final classes have open ends

Characteristics of median… Disadvantages The median is not so well suited to algebraic treatment as the arithmetic, geometric and harmonic means It is not so generally familiar as the arithmetic mean

3. Mode is the value which occurs most frequently the mode may not exist, and even if it does, it may not be unique it is the least useful (and least used) of the three measures of central tendency When the distribution has only one vale with highest frequency it is called Uni -modal If it has two values with equal and highest frequency it is called Bi-modal Similarly, it is possible to have multi-modal frequency Example: {1, 2, 2, 3, 3, 4, 4, 4, 5} The mode is 4, which is Uni-modal

Mode for grouped data usually refer to the modal class interval the modal class is the interval with the highest frequency Mode = L+W × D 1 D 1 +D 2 Where:- L= lower class limit of the modal class D 1 =Excess of modal frequency over frequency of next lower class D 2 =Excess of modal frequency over frequency of next higher class W= size of the modal class interval

Mode for grouped data… Example 1: Calculate the mode of the given data the modal class is 45-55, with a frequency of 31 the lower class limit of the modal class is 45 D 1 =31-29 =2 D 2 = 31-5= 26 W= 10 Mode= 45+ 10 × 31-29 31-29+ 31-5 = 45.7 CL 5-15 15-25 25-35 35-45 45-55 55-65 65-75 F 8 12 17 29 31 5 3

Mode for grouped data… Example 1: Calculate the mode of the given data the modal class is____, with a frequency of ___ the lower class limit of the modal class is ___ D 1 = D 2 = W= Mode= CL 0-10 10-20 20-40 40-60 60-80 80-100 F 10 15 25 30 14 6

Characteristics of Mode It is not affected by extreme values It is the most typical value of the distribution Advantages Since it is the most typical value it is the most descriptive average Since the mode is usually an “actual value”, it indicates the precise value of an important part of the series Disadvantages It is not capable of mathematical treatment In a small number of items the mode may not exist

II. Measures of Variation/ Dispersion While measures of central tendency are used to estimate " centeral " value of a data set, measures of dispersion are important for describing the spread of the data , or its variation around a central value Two distinct samples may have the same mean or median, but completely different levels of variability, or vice versa Set 1: 30, 40, 40, 50, 60, 60, 70 (Mean = 50) Set 2: 48, 49, 49, 50, 50, 51, 53 (Mean = 50)

Measures of Variation/ Dispersion… The objective of measuring this scatter or dispersion is to obtain a single summary figure which adequately exhibits whether the distribution is compact or spread out are important for describing the spread of the data or its variation around a central value Some of the commonly used measures of dispersion (variation) are: 1. Range (R) 2. Interquartile range (IQR) 3. Variance (S 2 ) 4. Standard deviation (SD) and 5. Coefficient of variation (CV)

1 . Range the difference between the highest and smallest observation in the data it is the crudest measure of dispersion it is a measure of absolute dispersion and cannot be usefully employed for comparing the variability of two distributions expressed in different units Range = X max - X min Where , X max = highest (maximum) value in the given distribution X min = lowest (minimum) value in the given distribution

Characteristics of Range Since it is based upon two extreme cases in the entire distribution, the range may be considerably changed if either of the extreme cases happens to drop out, while the removal of any other case would not affect it at all It wastes information for it takes no account of the entire data The extreme values may be unreliable; that is, they are the most likely to be faulty Not suitable with regard to the mathematical treatment required in driving the techniques of statistical inference

2. Quantiles are another approach that addresses some of the shortcomings of the range Of three types i . Quartiles :- which divides a given set of data into four equal parts ii. Deciles:- which divides the given set of data into ten equal parts iii. Percentiles :- which divides the given set of data into hundred equal parts

A. Quartiles is a measure of dispersion which divides the given set of data into four equal parts it will have three quartile such as Q 1 ,Q 2 , & Q 3 the three quartiles Q 1 , Q 2 , and Q 3 divide an ordered data set into four equal parts About ¼ of the data falls on or below the first quartile Q 1 About ½ of the data falls on or below the second quartile Q 2 (equivalent to median ) About ¾ of the data falls on or below the third quartile Q 3

Quartiles… I n order to identify the Quartiles of a given dataset: Sort the values in increasing order Identify the Quartiles accordingly; Q 1 = [(n+1)/4] th Q 2 = [2(n+1)/4] th Q 3 = [3(n+1)/4] th The inter-quartile range is the difference between the third and the first quartiles. IQR = Q 3 - Q 1

A. First Quartile is called Q 1 is a lowest quartile it calculates the 25% of the given data its meaning is 25% of the observation are below Q 1 but 75% of the observation is above Q 1 . it is calculated as:- Q 1 = 1 n +1 th observation 4 =0.25(n+1) th observation

B. Second Quartile is called Q 2 is a lower or the middle quartile it calculates 50% of the given data its meaning 50% of observations are below Q 2 and 50% are above Q 2 is called median it is calculated as:- Q 2 = 2 n +1 th observation 4 =0.5(n+1) th observation

C. Third Quartile is called Q3 it is a upper/highest quartile it calculates the 75% of the given data its meaning 75% are below Q3 and 25% are above Q3 it is calculated as:- Q 3 = 3 n +1 th observation 4 =0.75(n+1) th observation

Examples:- 1 . Let’s assume the following dataset presents the age of 8 factory workers. {18, 21, 23, 24, 24, 32, 42, 59} Identify the first and the third quartiles Solution: First make sure that the data is sorted in increasing order Q 1 is the {0.25 (n+1)} th observation  {0.25 (8+1)} th observation  {0.25 (9)} th observation  {2.25} th observation

Examples… i.e. the Q 1 is a quarter distance between 21 and 23 this can be interpolated as:  21 + (23-21)0.25 = 21.5 The interpretation is one forth of the observations are below or equal to the value 21.5 Q 3   is the {0.75(n+1)} th observation  {6.75} th observation  32 + (42-32)0.75 = 39.5 The interpretation is three forth of the observations are below or equal to the value 39.5

Examples… 2. Calculate Q1 ,Q2 ,Q3 and IQR, and give interpretation for the following datasets. 18, 29, 14, 42, 31, 23, 44, 32, 54

2. Percentiles( Reading assignment) Divides the given set of observations into 100 equal parts Each group represents 1% of the data set There are 99 percentiles termed P1 through P99 The 25 th percentile is the first quartile ( P 25 =Q 1 ) The 50 th percentile is the median ( P50 = Median) The 75 th percentile is the third quartile ( P 75 =Q 3) The interpretation of Percentiles is as follows: 1% of the data falls on or below P1 2% of the data falls on or below P2

Percentiles… P th percentile is defined as:- i . (K+1) th observation , if np /100 is not an integer. K is the largest integer below np /100. ii. ( np /100) th obser +( np /100+1) th obser , 2 if np /100 is an integer.

Examples:- 1. Calculate P 25% ,P 50% ,P 75% P 80%, and P 70% give interpretation for the following datasets. 18, 29, 14, 42, 31, 23, 44, 32, 54

2. Variance and standard deviation measure how far an average score deviate from the mean thus variance is as the sum of the square of the deviation of each observation from the mean divided by total number of observation minus 1 the variance represents squared units and, therefore, is not an appropriate measure of dispersion when we wish to express this concept in terms of original units to obtain a measure of dispersion in original units, we merely take the square root of the variance( standard deviation)

Variance and standard deviation… It is positive square root of the variance Standard deviation is the most commonly used measure of dispersion Standard deviation is the average deviation from the mean (expressed in the original units) Standard deviation is measure of absolute deviation

Variance and standard deviation… the formulas for sample and population variance are given as follows: Sample variance Population variance occasionally, the abbreviations SD for standard deviation and Var (S 2 ) for variance are used

Variance and standard deviation… standard deviation for grouped data is calculated as: Where S = standard deviation mi = class mark x = mean f i = frequency n = number of observation

Why squared? Why square differences between data values and mean? Gives positive values Gives more weight to larger differences Has desirable statistical properties Why n - 1 for sample variance? Dividing by n underestimates population variance Dividing by n-1 gives unbiased estimate of population variance

Variance and standard deviation… Example. Find the standard deviation of the numbers 12, 6, 7, 3, 15, 10 ,18, 5. Solution: x = (12+6+7+3+15+10+18+5) /8= 9.5 The variance is s 2 = [(12-9.5) 2 +…+ (5-9.5) 2 ]/ (8-1) = 5.21 The standard deviation is s = √5.21 =2.28

Variance and standard deviation… Advantages : they accommodate further mathematical applications (SD) they are calculated from the whole observations Disadvantages : they must always be understood in the context of the mean of the data thus it is difficult to compare the standard deviation/variance of two datasets measured in two different units

Example 1.Consider the data on the weight of 10 new born children at Zewiditu hospital within a month: 2.51, 3.01, 3.25, 2.02,1.98, 2.33, 2.33, 2.98, 2.88, 2.43. Calculate a) Range (1.27) b) Variance (0.198) c) Standard deviation(0.44)

3. Coefficient of variation (CV) measure of relative variation/dispersion use to compare variation of distributions with different units relative to their means it is also sometimes called coefficient of dispersion this is a good way to compare measures of dispersion between different samples whose values don’t necessarily have the same magnitude (or, for that matter, the same units!)

Coefficient of variation… the standard formulation of the CV is the ratio of the standard deviation to the mean of a give data the coefficient of variation is a dimensionless number So when comparing between data sets with different units one should use CV instead of SD the CV is useful in comparing the variability of several different samples, each with different arithmetic mean as higher variability is expected when the mean increases CV is also important to compare reproducibility of variables

Coefficient of variation… Example1:- One patient’s blood pressure, measured daily over several weeks, averaged 182 with a standard deviation of 12.6, while that of another patient averaged 124 with a standard deviation of 9.4. Which patient’s blood pressure is relatively more variable?

Given s 1 =12.6 s 2 = 9.4 x 1 =182 x 2 = 124 blood pressure of the second patient is relatively more variable

Example 2 Suppose two samples of male individuals yield the following results. A comparison of the standard deviations might lead one to conclude that the two samples posses’ equal variability Sample 1 Sample2 Age 25 years 11 years Mean weight 145 pounds 80 pounds Standard deviation 10 pounds 10 pounds We wish to know which is more variable, the weights of the 25- year- olds or the weights of the 11-year-olds

If we compute the coefficients of variation, however, have for the 25-year-olds C.V=10/145(100) =6.9 And for the 11-year-olds C.V=10/80(100) =12.5 If we compare these results we get quite a different impression

Example 1. The following table shows the number of hours 45 hospital patients slept following administration of a certain anesthetic medication (10pts) 7 10 12 4 8 7 3 8 5 12 11 3 8 1 1 13 10 4 4 5 5 8 7 7 3 2 3 8 13 1 7 17 3 4 5 5 3 1 17 10 4 7 7 11 8

After grouping the above data in to frequency distribution table compute the following:- Mean Median Mode Variance Standard deviation Coefficient of variation

Thank You!!!!!!