This module briefly highlights the students with basic understandings of statistic principles ane methods to present evidence generation from the same about general population.
Size: 1.93 MB
Language: en
Added: Oct 31, 2025
Slides: 81 pages
Slide Content
University of Gondar
College of Medicine and Health science
Department of Epidemiology and Biostatistics
Biostatistics
By: Tilahun Y
February 2022
Chapter one: Introduction to Biostatistics
Basic statistical concepts
Classification of statistics
Types of variables
Application and limitation of statistics
Chapter two: Method of data collection and
presentation
Methods of data collection
Methods of data organization and
presentation
Data organization and presentation using
SPSS software
Chapter three: Summary measures
Measures of central tendency
Measures of dispersion
Measures of shape
Chapter four: Probability and probability
distributions
Definition of basic terms in probability
Set theory and probability
Types of probability
Random variable and probability distribution
Common probability distributions
3
Chapter five: Sampling methods and
sampling Distribution
Common terms usedinsampling
Sampling methods
ErrorsinSampling
Sampling distribution
Chapter six: Statistical inference
Sampling distribution and its properties
Estimation
Hypothesis testing
Paired and independent sample t-tests
Concept of analysis of variance
Basic assumptions and application of
ANOVA
4
5
Chapter seven: Correlation and linear regression
Analysis of Correlation
Simple Linear regression model
Multiple linear regression
Model diagnostic tests
Chapter Eight: Categorical data analysis
Concept of categorical data analysis
Chi-square test for Categorical data analysis
Binary logistic regression
Poisson regression
Negative Binomial regression
Zero-inflated Poisson regression
Chapter ten: Survival Data Analysis
Basic concepts of survival analysis
Non parametric methods
Cox proportional hazard model
Extended Cox proportional hazard models
Concepts of non-parametric method
Types of non-parametric test
Applications of Non-Parametric test
Introduction to Biostatistics
Objective
Describe basic statistical concepts
Classify a given statistical statement as
descriptive or inferential
Identify types of variables
Describe application of statistics in
medicine/public health
The word 'Statistics' can be defined into two
senses
a. In the plural sense:
Statistics is the raw data themselves, like
statistics of births, statistics of deaths, statistics
of students, statistics of imports and exports, etc.
b. In the singular sense:
Statistics is the subject that deals with the
collection, organization, presentation, analysis
and interpretation of numerical data
Biostatistics:is the field of statistics in which
statistical methods are widely used in medical and
biological studies.
Depending on how the data can be used, statistics
can be classified into the following branches;
1. Descriptive Statistics: is concerned with the
collection, organization, summarization, and
presentation of data.
2. Inferential Statistics: consists of generalizing from
samples to populations, performing estimations and
hypothesis tests, determining relationships among
variables, and making predictions.
It is important because statistical data usually arises from
sample.
Statistical techniques based on probability theory are
required.
2/21/2022 11
Main reason: handling variations:
Biological variation
Among individuals as well as within same
individual over time
Example: height, weight, blood pressure, eye color
...
Sample variation: Biomedical research projects
are usually carried out on small numbers of study
subjects
2/21/2022 12
Essential for scientific method of investigation
Formulate hypothesis
Design study to objectively test hypothesis
Collect reliable and unbiased data
Process and evaluate data rigorously
Interpret and draw appropriate conclusions
2/21/2022 13
14
Limitations of statistics
1.Itdeals with only those subjectsofinquiry
thatarecapableofbeingquantitatively
measuredandnumericallyexpressed
1.Itdealsonaggregatesoffactsand no
importanceisattachedtoindividualitems.
–suited onlyifthegroupcharacteristics
aredesiredtobestudied.
2.Statistical dataareonlyapproximationand
notmathematicallycorrect.
Parameter: It is numerical expression of population
measurements
E.g.population mean (μ), population variance,
population standard déviation(σ), e.t.c
Statistic: It is numerical expression of sample
measurements samplemean, sample variance.
Example: sample mean, sample variance (s
2
) and
standard deviation (s).
Population: Is the largest collection of
entities/values of a random variable for which we
have an interest at a particular time.
Sample: It is some part/subset of population of
interest.
2/21/2022 15
Target Population:A collection of items that have
something in common for which we wish to draw
conclusions at a particular time.
Study(sampled) Population:isa partoftarget
populationandactuallyaccessibleandlegitimatefor
datacollection
Sample:Asubsetofastudy population, about
which informationisactually obtained.Itisa partofa
population
16
TARGET POPULATION
STUDY POPULATION
Sample
Variable: A variable is a characteristic under
study that assumes different values for different
elements or it is a characteristic or attribute that
can assume different value.
Some examples of variables include:
Diastolic blood pressure,
Heart rate, height,
The weight and
Stage of bladder cancer to list some
2/21/2022 19
Random variable: are variables whose value
are determined by chance.
Data: the measurements or observations
(values) for a variable
Data set: it is a collection of observation on a
variable.
2/21/2022 20
MrDaniel Mrsara MrKedir
Age 23 35 27
Sex M F M
Religion Orthodox Protestant Muslim
Variable Data Data set
Value Many
2/21/2022 21
Depending on the characteristic of the
measurement, variable can be either qualitativeor
quantitative:
Qualitative(Categorical) variable
A variable or characteristic which cannot be measured
in quantitative form. But, can only be identified by name
or categories, or variable that can be placed into distinct
categories, according to some characteristic or
attribute.
For instance place of birth, ethnic group, type of
drug, stages of breast cancer (I, II, III, or IV), degree
of pain (low, moderate, sever).
2/21/2022 22
Quantitative (Numerical) variable: Is one that
can be measured and expressed numerically.
They can be of two types
1. Discrete variable:
The values of a discrete variable are usually
whole numbers, such as the number of episodes
of diarrhea in the first five years of life.
Observations can only take certain numerical
values
Numerical discrete data occur when the
observations are integersthat correspond with a
count of some sort.
2/21/2022 23
Some common examples are:
The number of bacteria colonies on a plate,
The number of cells within a prescribed area
upon microscopic examination
The number of heart beats within a specified
time interval,
A mother’s history of numbers of births (
parity) and pregnancies (gravidity),
The number of episodes of illness a patient
experiences during some time period, etc.
2/21/2022 24
A continuous variable: is a measurement on a
continuous scale
Each observation theoretically falls somewhere
along a continuum.
One is not restricted, in principle, to particular
values such as the integers of the discrete scale.
Most clinical measurements, such as:
Blood pressure,
Serum cholesterol level,
Height, weight, age etc. are on a numerical continuous
scale
2/21/2022 25
Data comes in various sizes and shapes
Depending on the nature of variable, variables can
be measured in four different levels of scales.
Nominal scales of measurement
It may be thought of as "naming" level. This level of
measurement do not put subjects in any particular
order.
There is no logical basis for saying one category is
higher or less than the other category. In research
activities a YES/NO scale is nominal.
2/21/2022 26
The nominal level of measurement classifies
data into mutually exclusive (non over
lapping), exhaustive categories in which no
order or ranking can be imposed on the data
2/21/2022 27
At this level we put subjects in order from lowest
to highest.
It is important to know that ranks do not tell us
by how much subjects differ.
Hence, an ordinal scale only lets you interpret
gross order and not the relative positional
distances.
Some of the examples under this scales of
measurement includes:
Academic status, response to treatment (none,
slow, moderate, fast)
2/21/2022 28
An interval measurement scales, one unit on the
scale represents the same magnitude on the trait
or characteristic being measured across the whole
range of the scale.
They do not have a "true" zero point, however,
and therefore it is not possible to make
statements about how many times higher one
score is than another.
Example:temperature The unit of measurement is
the degree, and the point of comparison is the
arbitrarily chosen “zero degrees,” which does
not indicate a lack of heat
2/21/2022 29
The highest level of measurement is the ratio
scale.
This scale is characterized by the fact that equality
of ratios as well as equality of intervals may be
determined.
Fundamental to the ratio scale is a true zero point.
The measurement of such familiar traits as height,
weight, and length makes use of the ratio scale.
2/21/2022 30
The ratio level of measurement possesses all
the characteristics of interval measurement,
and there exists a true zero. In addition, true
ratio exist between different units of measure.
2/21/2022 31
Gender
Grade(A, B, C, D and F )
Rating scale(poor, good,
excellent)
Eye color
Political affiliation
Religious affiliation
Ranking of tennis players
Major field
Nationality
Height
Weight
Age
IQ
Temperature
Salary
2/21/2022 32
There are two sources of data:
1 Primary Sources: Data measured or collect by the
investigator or the user directly from the source.
Two activities are often involved: planning and
measuring.
Planning: Concerned on
Identify source and elements of the data
Decide whether to consider sample or census
If sampling is preferred, decide on sample size, selection
method, etc
Decide measurement procedure
Set up the necessary organizational structure.
Measuring: there are different options: Focus Group,
Téléphone Interview, Mail Questionnaires etc.
2.Secondary data
The data needed to answer a question may already
exit in the form of published reports, commercially
available data banks, or the research literature.
In this case data were obtained from already
collected sources like newspaper, magazines,
EDHS, hospital records and existing data like;
Mortality reports
Morbidity reports
Epidemic reports
Reports of laboratory utilization (including laboratory test
results)
2/21/2022 34
Method of
data collection
2/21/2022 35
The most common modes of collecting data can
be summarized as:
Self-administered questionnaires
The use of documentary sources,
Observation
Interviews
Tape recording
Photography
Focus group discussion
2/21/2022 36
Descriptive statistic includes tables,
graphical /chart displays and calculation of
summary measuressuch as proportions
and averages
The methods of describing variables differ
depending on the type of data (continues or
Categorical).
2/21/2022 37
Descriptive statistics
The data collected in a survey is called raw data.
In most cases, useful information is not
immediately evident from the mass of unsorted
data.
Collected data need to be organized in such a
way as to condense the information they contain
in a way that will show patterns of variation
clearly.
2/21/2022 38
Pictorial description: It is describing statistical
data using frequency distributions and graphs
Numerical summaries: It is a method of
describing data using
-Measures of central tendency: like mean, median,
and mode
-Measures of dispersion: range, mean deviation,
variance, standard deviation, coefficient of
variation, etc.
-Other location parameters: percentiles and
quartiles
2/21/2022 39
Categorical variables are presented in the form of
Table (types of tables)
Frequency distribution
Relative frequency
Cumulative frequencies
Charts
Bar charts
Pie charts
2/21/2022 40
A table is designed to stand alone from the text.
Since a table is intended to communicate
information, it should be easy to read and
understand.
When we construct table, consider the following
1.Include clear and concise title answering the
questionwhat, when and where.
2.Clearly label the rows and columns
3.Explain abbreviations in the foot note
4.State clearly the unit of measurement
5.It should be self explanatory
6.The title of the table should be mentioned in
the top
7.If data is not original, mention the source
41
Quite often, the presentation of data in a
meaningful way is done by preparing a frequency
distribution.
A frequency distribution(FD): is the organization
of raw data in table form, using classes and
frequencies.
FD tables usually include frequency, relative frequency
(proportion), and cumulative frequency in addition to
the values/classes/categories of the variable.
Frequency is the number of observations in each
category
The relative frequency of a class is the proportion or
percentage of the data that falls in that class
2/21/2022 42
Frequency distribution determines the number of
units (e.g., people) which fall into a series of
specified categories
The Frequency is the count of the number of
times that a particular combination occurred in a
data
The relative frequency is the frequency of the
event/value/category divided by the total
number of observations
Frequency distribution can be grouped or
ungrouped
2/21/2022 43
It uses to present categorical variable in
simplified and easily understandable way
It is used for data that can be placed in
specific categories, such as nominal-or
ordinal-level data.
It can also be used for discrete numerical data
if the range is considerably small (<10)
This frequency table can be constructed by
listing all possible categories of the variable
and then counting the number laying on each
category of the variable as a frequency.
2/21/2022 44
Example: Let the blood types of 40 persons are
as follows:
O O A B A O A A A O B O B O O A O O A A A A AB A B A
A O O A O O A A A O A O O AB
Construct a frequency distribution for the data.
Solution:
Table #: marital status of women's in a certain district
46
Marital status Freq
Single 180
Married 320
Others 100
Total 600
In order to present data using grouped frequency
distribution, it is not as simple as that of
ungrouped.
In this case we need to compute some values.
These values are given below:
Number of class(K): The number of categories the table will
have
Number of class can be computed/ estimated using
Sturge’srule as:
K = 1+3.322log(n)
Where:
K= number of class
n=sample size /total number of observation.
2/21/2022 47
Then the width of each class, W, can be computed
as:
�=
������� �����−�������������
�
Class limit: The range for each class/ The smallest
and largest values that can go into any class; they
can be either lower or upper class limits.
Lower class limit: Smallest observation of the
category
Upper class limit: Smallest observation plus
width of the class minus one
2/21/2022 48
Class Boundaries/True Limits: are those limits,
which are determined mathematically to make an
interval of a continuous variable continuous in both
directions, and no gap exists between classes. It is
obtained by subtracting and adding 0.5 from lower
and upper class limit respectively
Lower class boundary
Upper class boundary
Relative frequency: is the frequency of each class
interval (fi) divided by the total frequency (n).
2/21/2022 49
Class mark/ Mid-point (X
m) of an interval: is the value
of the interval which lies mid-way between the lower
true limit (LTL) and the upper true limit (UTL) of a
class.
It is calculated as: The average of lower and upper
class limit.
�
�=
���������������+���������������
2
2/21/2022 50
NB: The constructed grouped frequency
distribution expected to be:
Class intervals should be continuous (for
continuous data), non overlapping(mutually
exclusive) and complete.
Class intervals should generally be of the same
width
Open indeed class intervals should be avoided.
These are classes like less then 10, greater than
65, and so on.
2/21/2022 51
Steps in grouping Data
1. Put data in ordered array
2. Choosing the number of class intervals
3. Sorting (tallying) of the data into these classes
4. Counting the number of observations (frequencies)
in each class, and
5. Displaying the results in the form of a chart or table
(optional)
2/21/2022 52
Sometimes it is necessary to use a cumulative
frequency distribution.
It is a distribution that shows the number of data
values less than or equal to a specific value (usually
an upper boundary).
The values are found by adding the frequencies of
the classes less than or equal to the upper class
boundary of a specific class.
Cumulative Frequency can be Less than CF (LCF) or
Greater than CF (GCF).
LCF counts the number of observations below the
upper class boundary of a certain class, where as GCF
counts above the lower class boundary of a certain
class.
Age of patients (years) (n=60) in a diabetic clinic in in
Gondar university Hospital, January 2008
19 82 98 78 30 26 32 66 87 81 40 48 70 61 69 58 60 53
28 54 47 40 80 56 36 53 65 28 90 95 45 32 34 36 20 62
51 20 17 26 70 81 39 63 33 66 61 7741 55 76 70 42 67
22 75 24 50 50 44
Based on this given data set
Construct grouped frequency distribution?
Construct a histogram and a frequency polygon?
2/21/2022 54
Graphs are often easier to interpret than tables,
perhaps at the expense of detail.
A variety of graphs are used depending on the
type of data
If we want to present categorical/qualitativeor
quantitative discrete data/variable using graph,
then pie chart and bar chart are the appropriate
ones, however if the variable is
numerical/quantitative continuous variable: in
nature, then we can use histogram, frequency
polygon, cumulative frequency curve, box plot…
2/21/2022 57
Bar diagrams are used to represent and compare
the frequency distribution of discrete variables
and attributes or categorical series.
When we represent data using bar diagram, all
the bars must have equal width and the distance
between bars must be equal.
Each category of variable is represented by a bar
It can be displayed as horizontal or vertical
The categories are represented on the
baseline (x-axis) at regular intervals and the
corresponding values frequencies or relative
frequencies represented on the Y-axis
(ordinate)
2/21/2022 58
Distribution of pediatric patients in a hospital by
type of diagnosis
2/21/2022 59
Diagnosis Number of patients Relative frequency
Pneumonia 487 48.7
Malaria 200 20
Cardiac problem 168 16.8
Malnutrition 80 8.0
Others 65 6.5
Fig.1: Distribution of pediatric patients in a hospital ward
by type of admitting diagnosis
2/21/2022 60
•Data from two or more variables
•Distinct colors or shading is used to differentiate
•Legend is necessary
•Figure #: The distribution of vaccination over marital status
of children born from 673 mothers in Gondar, 2012
61
0
20
40
60
80
100
120
140
Single married Divorced
requency
Marital Status
Vaccinated
Not Vaccinated
TT3
It presents two variables at a time
Bar represent one level from each variable
Shading or colors will be used to identify which part
come from which variables
Legend is very important
Example:Consider data on immunization status of women
by marital status
62
It is a circle divided into sectors so that the
areas of the sectors are proportional to the
frequencies.
It is split into segments to show percentages
or the relative contributions of categories of
data.
It is a good method of representation if you
wish to compare a part of group with the
whole group.
The number of categories should not be too
much.
2/21/2022 64
2/21/2022 65
Example: Distribution of death for females in
England, 1989
Cause of death Number of death Percentage %
Circulatory system (C) 100,000 42.37
Neoplasm (N) 70,000 29.66
Respiratory system(R) 30,000 12.71
Injury and poisoning (I)6,000 2.54
Digestive system (D) 10,000 4.24
Others (O) 20,000 8.47
Total 236,000 100
Fig.2: distribution of death for females in England in 1989
2/21/2022 66
Histograms: is the graph of the frequency
distributionof continuous measurement
variables.
It is constructed on the basis of the following
principles:
The horizontal axis is a continuous scale running
from one extreme end of the distribution to the
other.
It should be labeled with the name of the
variable and the units of measurement.
2/21/2022 67
2/21/2022 68
TABLE: Frequency Distribution of Ages of 189 Subjects
Based on this data set we can construct the histogram
Class InteClass boundary Freque
ncy
Cumulative
frequency
30–39 29.5-39.5 11 11
40–49 39.5-49.5 46 57
50–59 49.5-59.5 70 127
60–69 59.5-69.5 45 177
70–79 69.5-79.5 16 188
80–89 79.5-89.5 1 189
Total 189 189
2/21/2022 69
Figure.3:histogram of age of 189 subjects
A frequency distribution can be portrayed graphically
in yet another way by means of a frequency polygon,
which is a special kind of line graph
To draw a frequency polygon we first place a dot
above the midpoint of each class interval represented
on the horizontal axis of a graph like the one shown in
Figure 3.
The height of a given dot above the horizontal axis
corresponds to the frequency of the relevant class
interval.
Connecting the dots by straight lines produces the
frequency polygon.
2/21/2022 70
Fig.4 frequency polygon of age
189 subjects
Fig.5 histogram and frequency
polygon of age of 189 subjects
2/21/2022 71
The distribution of the blood lead level of 88 individuals
Blood LL frequency
19.5-22.5 4
22.5-25.5 12
25.5-28.5 19
28.5-31.5 21
31.5-34.5 16
34.5-37.5 13
37.5-40.5 3
72
19.5 22.5 25.5 28.5 31.5 34.5 37.5 40.5
Blood lead level
73
Frequency polygons are superiorto histograms for
comparing twoor moresets of data.
The horizontal axis displays the different
categories/intervals
The vertical axis displays cumulative (relative)
frequency.
A point is placed at the true upper limit of each
interval; the height represents the cumulative relative
frequency associated with that interval. The points are
then connected by straight lines.
Like frequency polygons, cumulative frequency curve
may be used to compare sets of data.
Cumulative frequency curve can also be used to obtain
percentilesof a set of data.
74
Cumulative relative frequency curve for the blood lead
level of study participants
C
u
m
ulat
ive
f
r
e
quency
(
p
r
p
or
ti
on
o
f
i
n
d
ivi
d
uals
)
The graph ends
at the upper
boundary of the
last class.
The graph begins at the lower
boundary of the first class.
•A visual picture called box (box-and-whisker )plotcan
be used to convey a fairamount of information about
the distribution of a set of data.
•It is used as an exploratory data analysis tool
•The box shows the distance between the firstand the
thirdquartiles,
•The medianis marked as a line within the box and
•The end lines show the minimum and maximum values
respectively
76
Box plot is the five-number summary:
The minimum entry
Q
1
Q
2
(median)
Q
3
The maximum entry
Box plots cont…
The quartiles are sets of values which divide the
distribution into four parts such that there are an
equal number of observations in each part.
Q
1= [(n+1)/4]
th
Q
2= [2(n+1)/4]
th
Q
3= [3(n+1)/4]
th
Example: Use the following age data of 15 patients to
draw a box-and-whisker plot.
3535 36373738 42 4343 44 45 4848 51 55
Box plots cont…
Q
3
Q
2Q
1
MaxMin
79
Notice the
distribution of data
in each
quarter(distance
between quartiles)