fundamentals of Biostatistics lecture note.pdf

University of Gondar
College of Medicine and Health science
Department of Epidemiology and Biostatistics
Biostatistics
By: Tilahun Y
February 2022

Chapter one: Introduction to Biostatistics
Basic statistical concepts
Classification of statistics
Types of variables
Application and limitation of statistics
Chapter two: Method of data collection and
presentation
Methods of data collection
Methods of data organization and
presentation
Data organization and presentation using
SPSS software

Chapter three: Summary measures
Measures of central tendency
Measures of dispersion
Measures of shape
Chapter four: Probability and probability
distributions
Definition of basic terms in probability
Set theory and probability
Types of probability
Random variable and probability distribution
Common probability distributions

3

Chapter five: Sampling methods and
sampling Distribution
Common terms usedinsampling
Sampling methods
ErrorsinSampling
Sampling distribution
Chapter six: Statistical inference
Sampling distribution and its properties
Estimation
Hypothesis testing
Paired and independent sample t-tests
Concept of analysis of variance
Basic assumptions and application of
ANOVA
4

5
Chapter seven: Correlation and linear regression
Analysis of Correlation
Simple Linear regression model
Multiple linear regression
Model diagnostic tests
Chapter Eight: Categorical data analysis
Concept of categorical data analysis
Chi-square test for Categorical data analysis
Binary logistic regression

Poisson regression
Negative Binomial regression
Zero-inflated Poisson regression
Chapter ten: Survival Data Analysis
Basic concepts of survival analysis
Non parametric methods
Cox proportional hazard model
Extended Cox proportional hazard models

Concepts of non-parametric method
Types of non-parametric test
Applications of Non-Parametric test

Introduction to Biostatistics
Objective
Describe basic statistical concepts
Classify a given statistical statement as
descriptive or inferential
Identify types of variables
Describe application of statistics in
medicine/public health

The word 'Statistics' can be defined into two
senses
a. In the plural sense:
Statistics is the raw data themselves, like
statistics of births, statistics of deaths, statistics
of students, statistics of imports and exports, etc.
b. In the singular sense:
Statistics is the subject that deals with the
collection, organization, presentation, analysis
and interpretation of numerical data
Biostatistics:is the field of statistics in which
statistical methods are widely used in medical and
biological studies.

Depending on how the data can be used, statistics
can be classified into the following branches;
1. Descriptive Statistics: is concerned with the
collection, organization, summarization, and
presentation of data.
2. Inferential Statistics: consists of generalizing from
samples to populations, performing estimations and
hypothesis tests, determining relationships among
variables, and making predictions.
It is important because statistical data usually arises from
sample.
Statistical techniques based on probability theory are
required.

2/21/2022 11

Main reason: handling variations:
Biological variation
Among individuals as well as within same
individual over time
Example: height, weight, blood pressure, eye color
...
Sample variation: Biomedical research projects
are usually carried out on small numbers of study
subjects
2/21/2022 12

Essential for scientific method of investigation
Formulate hypothesis
Design study to objectively test hypothesis
Collect reliable and unbiased data
Process and evaluate data rigorously
Interpret and draw appropriate conclusions
2/21/2022 13

14
Limitations of statistics
1.Itdeals with only those subjectsofinquiry
thatarecapableofbeingquantitatively
measuredandnumericallyexpressed
1.Itdealsonaggregatesoffactsand no
importanceisattachedtoindividualitems.
–suited onlyifthegroupcharacteristics
aredesiredtobestudied.
2.Statistical dataareonlyapproximationand
notmathematicallycorrect.

Parameter: It is numerical expression of population
measurements
E.g.population mean (μ), population variance,
population standard déviation(σ), e.t.c
Statistic: It is numerical expression of sample
measurements samplemean, sample variance.
Example: sample mean, sample variance (s
2
) and
standard deviation (s).
Population: Is the largest collection of
entities/values of a random variable for which we
have an interest at a particular time.
Sample: It is some part/subset of population of
interest.
2/21/2022 15

Target Population:A collection of items that have
something in common for which we wish to draw
conclusions at a particular time.
Study(sampled) Population:isa partoftarget
populationandactuallyaccessibleandlegitimatefor
datacollection
Sample:Asubsetofastudy population, about
which informationisactually obtained.Itisa partofa
population
16

TARGET POPULATION
STUDY POPULATION
Sample

Variable: A variable is a characteristic under
study that assumes different values for different
elements or it is a characteristic or attribute that
can assume different value.
Some examples of variables include:
Diastolic blood pressure,
Heart rate, height,
The weight and
Stage of bladder cancer to list some
2/21/2022 19

Random variable: are variables whose value
are determined by chance.
Data: the measurements or observations
(values) for a variable
Data set: it is a collection of observation on a
variable.
2/21/2022 20

MrDaniel Mrsara MrKedir
Age 23 35 27
Sex M F M
Religion Orthodox Protestant Muslim
Variable Data Data set
Value Many
2/21/2022 21

Depending on the characteristic of the
measurement, variable can be either qualitativeor
quantitative:
Qualitative(Categorical) variable
A variable or characteristic which cannot be measured
in quantitative form. But, can only be identified by name
or categories, or variable that can be placed into distinct
categories, according to some characteristic or
attribute.
For instance place of birth, ethnic group, type of
drug, stages of breast cancer (I, II, III, or IV), degree
of pain (low, moderate, sever).
2/21/2022 22

Quantitative (Numerical) variable: Is one that
can be measured and expressed numerically.
They can be of two types
1. Discrete variable:
The values of a discrete variable are usually
whole numbers, such as the number of episodes
of diarrhea in the first five years of life.
Observations can only take certain numerical
values
Numerical discrete data occur when the
observations are integersthat correspond with a
count of some sort.
2/21/2022 23

Some common examples are:
The number of bacteria colonies on a plate,
The number of cells within a prescribed area
upon microscopic examination
The number of heart beats within a specified
time interval,
A mother’s history of numbers of births (
parity) and pregnancies (gravidity),
The number of episodes of illness a patient
experiences during some time period, etc.
2/21/2022 24

A continuous variable: is a measurement on a
continuous scale
Each observation theoretically falls somewhere
along a continuum.
One is not restricted, in principle, to particular
values such as the integers of the discrete scale.
Most clinical measurements, such as:
Blood pressure,
Serum cholesterol level,
Height, weight, age etc. are on a numerical continuous
scale
2/21/2022 25

Data comes in various sizes and shapes
Depending on the nature of variable, variables can
be measured in four different levels of scales.
Nominal scales of measurement
It may be thought of as "naming" level. This level of
measurement do not put subjects in any particular
order.
There is no logical basis for saying one category is
higher or less than the other category. In research
activities a YES/NO scale is nominal.
2/21/2022 26

The nominal level of measurement classifies
data into mutually exclusive (non over
lapping), exhaustive categories in which no
order or ranking can be imposed on the data
2/21/2022 27

At this level we put subjects in order from lowest
to highest.
It is important to know that ranks do not tell us
by how much subjects differ.
Hence, an ordinal scale only lets you interpret
gross order and not the relative positional
distances.
Some of the examples under this scales of
measurement includes:
Academic status, response to treatment (none,
slow, moderate, fast)
2/21/2022 28

An interval measurement scales, one unit on the
scale represents the same magnitude on the trait
or characteristic being measured across the whole
range of the scale.
They do not have a "true" zero point, however,
and therefore it is not possible to make
statements about how many times higher one
score is than another.
Example:temperature The unit of measurement is
the degree, and the point of comparison is the
arbitrarily chosen “zero degrees,” which does
not indicate a lack of heat
2/21/2022 29

The highest level of measurement is the ratio
scale.
This scale is characterized by the fact that equality
of ratios as well as equality of intervals may be
determined.
Fundamental to the ratio scale is a true zero point.
The measurement of such familiar traits as height,
weight, and length makes use of the ratio scale.
2/21/2022 30

The ratio level of measurement possesses all
the characteristics of interval measurement,
and there exists a true zero. In addition, true
ratio exist between different units of measure.
2/21/2022 31

Gender
Grade(A, B, C, D and F )
Rating scale(poor, good,
excellent)
Eye color
Political affiliation
Religious affiliation
Ranking of tennis players
Major field
Nationality
Height
Weight
Age
IQ
Temperature
Salary
2/21/2022 32

There are two sources of data:
1 Primary Sources: Data measured or collect by the
investigator or the user directly from the source.
Two activities are often involved: planning and
measuring.
Planning: Concerned on
Identify source and elements of the data
Decide whether to consider sample or census
If sampling is preferred, decide on sample size, selection
method, etc
Decide measurement procedure
Set up the necessary organizational structure.
Measuring: there are different options: Focus Group,
Téléphone Interview, Mail Questionnaires etc.

2.Secondary data
The data needed to answer a question may already
exit in the form of published reports, commercially
available data banks, or the research literature.
In this case data were obtained from already
collected sources like newspaper, magazines,
EDHS, hospital records and existing data like;
Mortality reports
Morbidity reports
Epidemic reports
Reports of laboratory utilization (including laboratory test
results)
2/21/2022 34

Method of
data collection
2/21/2022 35

The most common modes of collecting data can
be summarized as:
Self-administered questionnaires
The use of documentary sources,
Observation
Interviews
Tape recording
Photography
Focus group discussion
2/21/2022 36

Descriptive statistic includes tables,
graphical /chart displays and calculation of
summary measuressuch as proportions
and averages
The methods of describing variables differ
depending on the type of data (continues or
Categorical).
2/21/2022 37
Descriptive statistics

The data collected in a survey is called raw data.
In most cases, useful information is not
immediately evident from the mass of unsorted
data.
Collected data need to be organized in such a
way as to condense the information they contain
in a way that will show patterns of variation
clearly.
2/21/2022 38

Pictorial description: It is describing statistical
data using frequency distributions and graphs
Numerical summaries: It is a method of
describing data using
-Measures of central tendency: like mean, median,
and mode
-Measures of dispersion: range, mean deviation,
variance, standard deviation, coefficient of
variation, etc.
-Other location parameters: percentiles and
quartiles
2/21/2022 39

Categorical variables are presented in the form of
Table (types of tables)
Frequency distribution
Relative frequency
Cumulative frequencies
Charts
Bar charts
Pie charts
2/21/2022 40

A table is designed to stand alone from the text.
Since a table is intended to communicate
information, it should be easy to read and
understand.
When we construct table, consider the following
1.Include clear and concise title answering the
questionwhat, when and where.
2.Clearly label the rows and columns
3.Explain abbreviations in the foot note
4.State clearly the unit of measurement
5.It should be self explanatory
6.The title of the table should be mentioned in
the top
7.If data is not original, mention the source
41

Quite often, the presentation of data in a
meaningful way is done by preparing a frequency
distribution.
A frequency distribution(FD): is the organization
of raw data in table form, using classes and
frequencies.
FD tables usually include frequency, relative frequency
(proportion), and cumulative frequency in addition to
the values/classes/categories of the variable.
Frequency is the number of observations in each
category
The relative frequency of a class is the proportion or
percentage of the data that falls in that class
2/21/2022 42

Frequency distribution determines the number of
units (e.g., people) which fall into a series of
specified categories
The Frequency is the count of the number of
times that a particular combination occurred in a
data
The relative frequency is the frequency of the
event/value/category divided by the total
number of observations
Frequency distribution can be grouped or
ungrouped
2/21/2022 43

It uses to present categorical variable in
simplified and easily understandable way
It is used for data that can be placed in
specific categories, such as nominal-or
ordinal-level data.
It can also be used for discrete numerical data
if the range is considerably small (<10)
This frequency table can be constructed by
listing all possible categories of the variable
and then counting the number laying on each
category of the variable as a frequency.
2/21/2022 44

Example: Let the blood types of 40 persons are
as follows:
O O A B A O A A A O B O B O O A O O A A A A AB A B A
A O O A O O A A A O A O O AB
Construct a frequency distribution for the data.
Solution:

Table #: marital status of women's in a certain district
46
Marital status Freq
Single 180
Married 320
Others 100
Total 600

In order to present data using grouped frequency
distribution, it is not as simple as that of
ungrouped.
In this case we need to compute some values.
These values are given below:
Number of class(K): The number of categories the table will
have
Number of class can be computed/ estimated using
Sturge’srule as:
K = 1+3.322log(n)
Where:
K= number of class
n=sample size /total number of observation.
2/21/2022 47

Then the width of each class, W, can be computed
as:
&#3627408458;=
&#3627408447;&#3627408462;&#3627408479;&#3627408468;&#3627408466;&#3627408480;&#3627408481; &#3627408483;&#3627408462;&#3627408473;&#3627408482;&#3627408466;−&#3627408480;&#3627408474;&#3627408462;&#3627408473;&#3627408473;&#3627408466;&#3627408480;&#3627408481;&#3627408483;&#3627408462;&#3627408473;&#3627408482;&#3627408466;
&#3627408472;
Class limit: The range for each class/ The smallest
and largest values that can go into any class; they
can be either lower or upper class limits.
Lower class limit: Smallest observation of the
category
Upper class limit: Smallest observation plus
width of the class minus one
2/21/2022 48

Class Boundaries/True Limits: are those limits,
which are determined mathematically to make an
interval of a continuous variable continuous in both
directions, and no gap exists between classes. It is
obtained by subtracting and adding 0.5 from lower
and upper class limit respectively
Lower class boundary
Upper class boundary
Relative frequency: is the frequency of each class
interval (fi) divided by the total frequency (n).
2/21/2022 49

Class mark/ Mid-point (X
m) of an interval: is the value
of the interval which lies mid-way between the lower
true limit (LTL) and the upper true limit (UTL) of a
class.
It is calculated as: The average of lower and upper
class limit.
&#3627408485;
&#3627408474;=
&#3627408482;&#3627408477;&#3627408477;&#3627408466;&#3627408479;&#3627408464;&#3627408473;&#3627408462;&#3627408480;&#3627408480;&#3627408473;&#3627408470;&#3627408474;&#3627408470;&#3627408481;+&#3627408473;&#3627408476;&#3627408484;&#3627408466;&#3627408479;&#3627408464;&#3627408473;&#3627408462;&#3627408480;&#3627408480;&#3627408473;&#3627408470;&#3627408474;&#3627408470;&#3627408481;
2
2/21/2022 50

NB: The constructed grouped frequency
distribution expected to be:
Class intervals should be continuous (for
continuous data), non overlapping(mutually
exclusive) and complete.
Class intervals should generally be of the same
width
Open indeed class intervals should be avoided.
These are classes like less then 10, greater than
65, and so on.
2/21/2022 51

Steps in grouping Data
1. Put data in ordered array
2. Choosing the number of class intervals
3. Sorting (tallying) of the data into these classes
4. Counting the number of observations (frequencies)
in each class, and
5. Displaying the results in the form of a chart or table
(optional)
2/21/2022 52

Sometimes it is necessary to use a cumulative
frequency distribution.
It is a distribution that shows the number of data
values less than or equal to a specific value (usually
an upper boundary).
The values are found by adding the frequencies of
the classes less than or equal to the upper class
boundary of a specific class.
Cumulative Frequency can be Less than CF (LCF) or
Greater than CF (GCF).
LCF counts the number of observations below the
upper class boundary of a certain class, where as GCF
counts above the lower class boundary of a certain
class.

Age of patients (years) (n=60) in a diabetic clinic in in
Gondar university Hospital, January 2008
19 82 98 78 30 26 32 66 87 81 40 48 70 61 69 58 60 53
28 54 47 40 80 56 36 53 65 28 90 95 45 32 34 36 20 62
51 20 17 26 70 81 39 63 33 66 61 7741 55 76 70 42 67
22 75 24 50 50 44
Based on this given data set
Construct grouped frequency distribution?
Construct a histogram and a frequency polygon?
2/21/2022 54

Solution
First arrange the data increasing order:
17 19 20 20 22 24 26 26 28 28 30 32 32 33 34 36 36
39 40 40 41 42 45 47 48 50 50 51 53 53 54 54 55 56
58 60 61 61 62 63 65 66 66 67 69 70 70 70 75 76 77
78 80 81 81 82 87 90 95 98
R=H-S=98-17= 81
3. K=1+3.322logn=1+3.322log60=6.69 ≈7
4. W=R/K=81/6.69=12.1≈ 12
2/21/2022 55

CLASS CB CM FREQ LCFGCFR.FP.F
17-28 16.5-28.5 22.5 10 10 60
.1717%
29-40 28.5-40.5 34.5 10
20 500.1717%
41-52 40.5-52.5 46.5 8
28 400.1313%
53-64 52.5-64.5 58.5 12
40 320.220%
65-76 64.5-76.5 70.5 10
50 200.1717%
77-88 76.5-88.5 82.5 7
57 100.1212%
89-10088.5–
100.5
94.5 3 60
30.055%
Total 60
60 0
100%
2/21/2022 56

Graphs are often easier to interpret than tables,
perhaps at the expense of detail.
A variety of graphs are used depending on the
type of data
If we want to present categorical/qualitativeor
quantitative discrete data/variable using graph,
then pie chart and bar chart are the appropriate
ones, however if the variable is
numerical/quantitative continuous variable: in
nature, then we can use histogram, frequency
polygon, cumulative frequency curve, box plot…
2/21/2022 57

Bar diagrams are used to represent and compare
the frequency distribution of discrete variables
and attributes or categorical series.
When we represent data using bar diagram, all
the bars must have equal width and the distance
between bars must be equal.
Each category of variable is represented by a bar
It can be displayed as horizontal or vertical
The categories are represented on the
baseline (x-axis) at regular intervals and the
corresponding values frequencies or relative
frequencies represented on the Y-axis
(ordinate)
2/21/2022 58

Distribution of pediatric patients in a hospital by
type of diagnosis
2/21/2022 59
Diagnosis Number of patients Relative frequency
Pneumonia 487 48.7
Malaria 200 20
Cardiac problem 168 16.8
Malnutrition 80 8.0
Others 65 6.5

Fig.1: Distribution of pediatric patients in a hospital ward
by type of admitting diagnosis
2/21/2022 60

•Data from two or more variables
•Distinct colors or shading is used to differentiate
•Legend is necessary
•Figure #: The distribution of vaccination over marital status
of children born from 673 mothers in Gondar, 2012
61
0
20
40
60
80
100
120
140
Single married Divorced
requency
Marital Status
Vaccinated
Not Vaccinated
TT3

It presents two variables at a time
Bar represent one level from each variable
Shading or colors will be used to identify which part
come from which variables
Legend is very important
Example:Consider data on immunization status of women
by marital status
62

It is a circle divided into sectors so that the
areas of the sectors are proportional to the
frequencies.
It is split into segments to show percentages
or the relative contributions of categories of
data.
It is a good method of representation if you
wish to compare a part of group with the
whole group.
The number of categories should not be too
much.
2/21/2022 64

2/21/2022 65
Example: Distribution of death for females in
England, 1989
Cause of death Number of death Percentage %
Circulatory system (C) 100,000 42.37
Neoplasm (N) 70,000 29.66
Respiratory system(R) 30,000 12.71
Injury and poisoning (I)6,000 2.54
Digestive system (D) 10,000 4.24
Others (O) 20,000 8.47
Total 236,000 100

Fig.2: distribution of death for females in England in 1989
2/21/2022 66

Histograms: is the graph of the frequency
distributionof continuous measurement
variables.
It is constructed on the basis of the following
principles:
The horizontal axis is a continuous scale running
from one extreme end of the distribution to the
other.
It should be labeled with the name of the
variable and the units of measurement.
2/21/2022 67

2/21/2022 68
TABLE: Frequency Distribution of Ages of 189 Subjects
Based on this data set we can construct the histogram
Class InteClass boundary Freque
ncy
Cumulative
frequency
30–39 29.5-39.5 11 11
40–49 39.5-49.5 46 57
50–59 49.5-59.5 70 127
60–69 59.5-69.5 45 177
70–79 69.5-79.5 16 188
80–89 79.5-89.5 1 189
Total 189 189

2/21/2022 69
Figure.3:histogram of age of 189 subjects

A frequency distribution can be portrayed graphically
in yet another way by means of a frequency polygon,
which is a special kind of line graph
To draw a frequency polygon we first place a dot
above the midpoint of each class interval represented
on the horizontal axis of a graph like the one shown in
Figure 3.
The height of a given dot above the horizontal axis
corresponds to the frequency of the relevant class
interval.
Connecting the dots by straight lines produces the
frequency polygon.
2/21/2022 70

Fig.4 frequency polygon of age
189 subjects
Fig.5 histogram and frequency
polygon of age of 189 subjects
2/21/2022 71

The distribution of the blood lead level of 88 individuals
Blood LL frequency
19.5-22.5 4
22.5-25.5 12
25.5-28.5 19
28.5-31.5 21
31.5-34.5 16
34.5-37.5 13
37.5-40.5 3
72
19.5 22.5 25.5 28.5 31.5 34.5 37.5 40.5
Blood lead level

73
Frequency polygons are superiorto histograms for
comparing twoor moresets of data.

The horizontal axis displays the different
categories/intervals
The vertical axis displays cumulative (relative)
frequency.
A point is placed at the true upper limit of each
interval; the height represents the cumulative relative
frequency associated with that interval. The points are
then connected by straight lines.
Like frequency polygons, cumulative frequency curve
may be used to compare sets of data.
Cumulative frequency curve can also be used to obtain
percentilesof a set of data.
74

Cumulative relative frequency curve for the blood lead
level of study participants
C
u
m
ulat
ive
f
r
e
quency
(
p
r
p
or
ti
on
o
f

i
n
d
ivi
d
uals
)
The graph ends
at the upper
boundary of the
last class.
The graph begins at the lower
boundary of the first class.

•A visual picture called box (box-and-whisker )plotcan
be used to convey a fairamount of information about
the distribution of a set of data.
•It is used as an exploratory data analysis tool
•The box shows the distance between the firstand the
thirdquartiles,
•The medianis marked as a line within the box and
•The end lines show the minimum and maximum values
respectively
76

Box plot is the five-number summary:
The minimum entry
Q
1
Q
2
(median)
Q
3
The maximum entry
Box plots cont…
The quartiles are sets of values which divide the
distribution into four parts such that there are an
equal number of observations in each part.
Q
1= [(n+1)/4]
th
Q
2= [2(n+1)/4]
th
Q
3= [3(n+1)/4]
th

Example: Use the following age data of 15 patients to
draw a box-and-whisker plot.
3535 36373738 42 4343 44 45 4848 51 55
Box plots cont…
Q
3
Q
2Q
1
MaxMin

79
Notice the
distribution of data
in each
quarter(distance
between quartiles)

80

2/21/2022 81

fundamentals of Biostatistics lecture note.pdf

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

fundamentals of Biostatistics lecture note.pdf

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77

Slide 78

Slide 79