Biostatistics notes for Masters in Public Health

296 views 135 slides Dec 31, 2024
Slide 1
Slide 1 of 135
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130
Slide 131
131
Slide 132
132
Slide 133
133
Slide 134
134
Slide 135
135

About This Presentation

Notes on biostatistics


Slide Content

INTRODUCTION TO BIOSTATISTICS Dr. Higenyi
Emmanuel (PhD)

SCOPE
Part 1 Introduction
•Definitions
•Importance of statistics
•Application of biostatistics
•Statistical notations
•Types of data
•Variables
•Sources of data
•Data presentation
•Data summarization
•Sampling
•Probability
Part 2 Basic Data statistical analysis
•t-test
•z-test
•Binomial test
•Chi-square test
•Fischer exact test
•Corelation
•Simple linear regression

PART 1: DEFINITIONS
Statistics
•The study and manipulation of data, including ways to gather, review,
analyze, and draw conclusions from data.
•The two major areas of statistics are descriptive and inferential statistics.
•Statistics can be communicated at different levels ranging from non-numerical
descriptor (nominal-level) to numerical in reference to a zero-point (ratio-
level).
•Several sampling techniques can be used to compile statistical data,
including simple random, systematic, stratified, or cluster sampling.
•Statistics are present in almost every department of every company and are
an integral part of investing.

PART 1: DEFINITIONS
Biostatistics or biometry
•Branch of biological science concerned with the study and methods for
collecting, presenting, analysing and interpreting biological research
data.
•The primary aim of this branch of science is to allow researchers, health
care providers and public health administrators to make decisions
concerning a population using sample data.
•For example, the government wants to know the prevalence of a specific
health problem among residents in a given town. If there are 3 million
residents in the town it may not be realistic to test them individually and
determine whether they have the disease or are susceptible to it.

PART 1: DEFINITIONS
Biostatistics or biometry
•The realistic and cost-effective approach is to study a representative subset
of the population and apply their results to the entire group.
•Hence biostatistics makes research possible by providing tools and
techniques for collecting, analysing and interpreting biological and medical
data, allowing stakeholders to draw actionable insights about a population
from sample data.
•Biostatisticians usually get their data from a wide range of sources,
including medical records, peer-reviewed literature, claims records, vital
records, disease registries, surveillance, experiments and surveys.
•The professionals collaborate with scientists, health care providers, public
health administrators and other stakeholders.

PART 1: DEFINITIONS
Biostatistics or biometry sources of data
•Medical records:Medical records can provide researchers with data about
diagnoses, lab tests and procedures common amongst a specific population,
such as people above 50 years working in the police force.
•Claims data:Scientists can get data about doctor's appointments and medical
bills in claims data.
•Vital records:Vital records contain information about births, deaths, causes of
death and divorces.
•Peer-reviewed literature:Researchers can also pull data from the articles and
studies that experts in a particular field published in peer-reviewed journals.
•Surveys:The researchers can collect primary data using surveys designed
specifically for an experiment.
•Disease registries:These systems help to collect, store, analyse, retrieve and
disseminate information regarding people living with specific disease

Part 1 Types of statistics:

PART 1: DESCRIPTIVE STATISTICS
Descriptive statistics
•Mostly focus on the central tendency, variability, and distribution of sample data.
•Central tendency means the estimate of the characteristics, a typical element of
a sample or population-It includes descriptive statistics such asmean,median,
andmode.
•Variability refers to a set of statistics that show how much difference there is
among the elements of a sample or population along the characteristics
measured. It includes metrics such as range,variance, andstandard deviation.
•The distribution refers to the overall “shape” of the data, which can be depicted
on a chart such as a histogram or a dot plot, and includes properties such as the
probability distribution function, skewness, and kurtosis

CENTRAL TENDENCY, VARIABILITY
60, 65, 66, 68, 70, 70, 70, 70, 70, 71, 72, 75, 80, 81, 82, 83,85, 86, 88, 90
Sum=1502
Mean =1502/20=75.1
Mode =70
Median =71.5
90

CENTRAL TENDENCY, VARIABILITY
60, 65, 66, 68, 70, 70, 70, 70, 70, 71, 72, 75, 80, 81, 82, 83,85, 86, 88, 90
Mean=75.1, SS=1394, SQRT OF (SS/20=SD)=8.3
90
1234567891011121314151617181920
6065666870707070707172758081828385868890
7575757575757575757575757575757575757575
-15-10-9-7-5-5-5-5-5-4-30567810111315
22510081492525252525169025364964100121169225

FORMULA FOR SD
SD= SQRT OF [SUM
OF(Number-the
mean)/the number of
elements in the data set)]

SD
1 Calculate the mean
2.Subtract themeanfromeachelementindividually
3. Square the differences from subtraction
4.Get the sum of the squared differences
5.Divide thesumofthesquareddifference by the number of elements
6.Get thesquareroot of theanswerafterdivision(quotient)=SD

DESCRIPTIVE STATISTICS
Central tendency
Mean
Median
mode
Variability
Range
Variance
SD
Shape or distribution
Skewness and Ketosis
Relative frequencies/proportions
Graphs andcharts and tables

PART 1: DESCRIPTIVE STATISTICS
Descriptive statistics
•Can also describe differences between observed
characteristics of the elements of a data set.
•Can help us understand the collective properties of the
elements of a data sample and form the basis for testing
hypotheses and making predictions using inferential statistic
•Useful in summarizing data
•Can be in form of numbers, tables or graphs

Part 1: Descriptive statistics

PART 1: INFERENTIAL STATISTICS
Inferential statistics
•Is a tool that statisticians use to draw conclusions about the characteristics of a
population, drawn from the characteristics of a sample, and to determine how
certain they can be of the reliability of those conclusions.
•Based on the sample size and distribution, statisticians can calculate the
probability that statistics, which measure the central tendency, variability,
distribution, and relationships between characteristics within a data sample,
provide an accurate picture of the corresponding parameters of the whole
population from which the sample is drawn.
•Are used to make generalizations about large groups, such as estimating
average demand for a product by surveying a sample of consumers’ buying
habits or attempting to predict future events.

Part 1: Inferential Statistics

Part 1:

FACTORS ASSOCIATED WITH MEN’S INVOLVEMENT IN
ANTENATAL CARE VISITS IN ASMARA, ERITREA: COMMUNITY-
BASED SURVEY
The necessity for a pregnant woman to attend ANC was recognized by almost all
(98.7%) of the male partners; however, 26.6% identified a minimum frequency of
ANC visits.
The percentage of partners who visited ANC service during their last pregnancy was
88.6%. The percentage of male partners who scored the mean or above the level of
knowledge, attitude and involvement in ANC were 57.0, 57.5, and 58.7, respectively.
Religion (p = 0.006, AOR = 1.91, 95% CI 1.20–3.03), level of education (p =
0.027, AOR = 1.96, 95% CI 1.08–3.57), and level of knowledge (p<0.001, AOR =
3.80, 95% CI 2.46–5.87) were significantly associated factors of male involvement in
ANC.

METHODS USED
List of households with pregnant women was prepared for each administration area
and was used as sampling frame
A community-based cross-sectional survey was applied using a two-stage sampling
technique to select 605 eligible respondents in Asmara in 2019.
Data was collected using a pretested structured questionnaire.
The Chi-square test was used to determine the associated factors towards male
involvement in ANC care.
Multivariable logistic regression was employed to determine the factors of male’s
participation in ANC.
A P-value less than 0.05 was considered statistically significant.

USE-CASE INFORMATION NEEDED
Define target population
State the type of statistics you expect (descriptive, inferential, or both)
State the possible sources of data

PART 1: INFERENTIAL TESTS
Inferential tests
•Tests concerned with using selected sample data compared with population
data in a variety of ways are called inferential statistical tests.
•There are two main bodies of these tests.
•The first and most frequently used are called parametric statistical tests.
•The second are called nonparametric tests.
•For each parametric test, there may be a comparable nonparametric test,
sometimes even two or three.
•Parametric tests are tests of significance appropriate when the data
represent an interval or ratio scale of measurement

PART 1: INFERENTIAL TESTS
Parametric tests
•Tests of significance appropriate when the data represent an interval or ratio
scale of measurement and other specific assumptions have been met, specifically,
that the sample statistics relate to the population parameters, that the variance of
the sample relates to the variance of the population, that the population has
normality, and that the data are statistically independent.
Nonparametric tests
•Statistical tests used when the data represent a nominal or ordinal level scale or
when assumptions required for parametric tests cannot be met, specifically, small
sample sizes, biased samples, an inability to determine the relationship between
sample and population, and unequal variances between the sample and
population. These are a class of tests that do not hold the assumptions of normality.

PART 1: DATA TYPES
Data types
Qualitative
DichotomousMultinomial
Quantitative
Discrete Continuous

ILLUSTRATION OF QUALITATIVE AND
QUANTITATIVE DATA
To assess the nutritional status and to determine potential risk factors of malnutrition
in children under 3 years of age in Nghean, Vietnam.
The study carried out in November 2007, a total of 383 child/mother pairs were
selected by using a 2-stage cluster sampling methodology. A structured questionnaire
was administered to mothers in their home settings.
Anthropometric measurement was defined as being underweight (weight for age),
wasting (weight for height) and stunting (height for age) on the basis of reference
data from the National Center for Health Statistics (NCHS) / World Health
Organization (WHO).

ILLUSTRATION OF QUALITATIVE AND QUANTITATIVE
DATA
Logistic regression analysis was used to into account the hierarchical relationships between potential
determinants of malnutrition.
The mean Z-score for weight-for-age was -1.51 (95% CI -1.64, -1.38), for height-for-age was -
1.51 (95% CI -1.65, -1.37) and for weight-for-height was -0.63 (95% CI -0.78, -0.48). Of the
children, 103 (27.7%) were underweight, 135 (36.3%) were stunted and 38 (10.2%) were wasted.
Region of residence, ethnic, mother’s occupation, household size, mother’s BMI, number of children in
family, weight at birth, time of initiation of breast-feeding and duration of exclusive breast-feeding
were found to be significantly related to malnutrition.
The findings of this study indicates that malnutrition is still an important problem among children
under three years of age in Nghean, Vietnam. Socio-economic,environmental factorsand feeding
practices are significant risk factors for malnutrition among under-three.

PART 1: COMMON STATISTICAL TERMS
Binomial test
•When a test has two alternative outcomes, either failure or success, and you
know what the possibilities of success are, you may apply a binomial test.
•Use a binomial test to determine if an observed test outcome is different
from its predicted outcome.
Causation
•Causation is a direct relationship between two variables.
•Two variables have a direct relationship if a change in one’s value causes a
change in the other variable.
•In that case, one becomes the cause, and the other is the effect.

PART 1: COMMON STATISTICAL TERMS
Confidence interval
•A confidence interval measures the level of uncertainty of a collection of
data.
•This is the range in which you anticipate your values to fall within a specific
degree of confidence if you repeat the same experiment.
Correlation coefficient
•The correlation coefficient describes the level of correlation or dependence
between two variables.
•This value is a number between -1 and +1, and if it falls beyond this limit,
there’s been a mistake in the measurement of a coefficient.

PART 1: COMMON STATISTICAL TERMS
Z-score:
•A score expressed in units of standard deviations from
the mean. It is also known as a standard score.
Z-test:
•A test of any of a number of hypotheses in inferential
statistics that has validity if sample sizes are sufficiently
large and the underlying data are normally distributed.

PART 1: COMMON STATISTICAL TERMS
Hypothesis tests
•A hypothesis test is a method of testing results. Before conducting research, the researcher creates a
hypothesis or a theory for what they believe the results will prove.
•A study then tests that theory.
Kruskal-Wallis one-way analysis of variance:
•A nonparametric inferential statistic used to compare two or more independent groups for statistical
significance of differences.
Mann-Whitney U-test (U):
•A nonparametric inferential statistic used to determine whether two uncorrelated groups differ
significantly.
McNemar’s test:
•A nonparametric method used on nominal data to determine whether the row and column marginal
frequencies are equal. *NPT

PART 1: COMMON STATISTICAL TERMS
Dependent variable
•A dependent variable is a value that depends on another variable to exhibit change.
•When computing in statistical analysis, you can use dependent variables to make conclusions about causes of
events, changes and other translations in statistical research.
Independent variable
•In a statistical experiment, an independent variable is one that you modify, control or manipulate in order to
investigate its effects.
•It's called independent since no other factor in the research affects it.
Multivariate analysis of covariance (MANCOVA):
•An extension of ANOVA that incorporates two or more dependent variables in the same analysis. It is an
extension of MANOVA where artificial dependent variables (DVs) are initially adjusted for differences in one or
more covariates. It computes the multivariate F statistic.
Multivariate analysis of variance (MANOVA):
•It is an ANOVA with several dependent variables.

PART 1: COMMON STATISTICAL TERMS
One-way analysis of variance (ANOVA):
•An extension of the independent group t-test where you have more than two groups. It computes the
difference in means both between and within groups and compares variability between groups and
variables. Its parametric test statistic is the F-test.
Pearson correlation coefficient (r): T
•This is a measure of the correlation or linear relationship between two variables x and y, giving a value
between +1 and −1 inclusive.
•It is widely used in the sciences as a measure of the strength of linear dependence between two
variables.
Pooled point estimate:
•An approximation of a point, usually a mean or variance, that combines information from two or more
independent samples believed to have the same characteristics.
•It is used to assess the effects of treatment samples versus comparative samples

PART 1: COMMON STATISTICAL TERMS
Standard deviation
•The standard deviation is a metric that calculates the square root of a variance. It informs you
how far a single or group result deviates from the average.
Standard error of the mean
•A standard error of mean assesses the likelihood of a sample's mean deviating from the
population mean. You can find the standard error of the mean if you divide the standard
deviation by the square root of the sample size.
Range
•The range is the difference between the lowest and highest values in a collection of data.
Quartile and quintile
•Quartile refers to data divided into four equal parts, while quintile refers to data divided into
five equal parts.

PART 1: COMMON STATISTICAL TERMS
Pearson correlation coefficient
•Pearson's correlation coefficient is a statistical test that determines the connection between two continuous
variables.
•Since it is based on covariance, they recognize it as the best approach to quantify the relationship among
variables of interest.
Median
•The median refers to the middle point of data.
•Typically, if you have a data set with an odd number of items, the median appears directly in the middle of
the numbers.
•When computing the median of a set of data with an even number of items, you can calculate the simple
mean between the two middle-most values to achieve the median.
Mode
•Mode refers to the value in a database that repeats the most number of times. If none of the values repeat,
there’s no mode in that database.

PART 1: COMMON STATISTICAL TERMS
Statistical inference
•Statistical inference occurs when you use sample data to generate an inference or conclusion.
Statistical inference can include regression, confidence intervals or hypothesis tests.
Statistical power
•Statistical power is a metric of a study's probability of discovering statistical relevance in a
sample, provided the effect is present in the entire population. A powerful statistical test likely
rejects the null hypothesis.
Runs test:
•Where measurements are made according to some well-defined ordering, in either time or space.
•A frequent question is whether or not the average value of the measurement is different at
different points in the sequence. This nonparametric test provides a means for this

PART 1: COMMON STATISTICAL TERMS
T-score
•A t-score in a t-distribution refers to the number of standard deviations a sample is away
from the average.
Z-score
•A z-score, also known as a standard score, is a measurement of the distance between the
mean and data point of a variable. You can measure it in standard deviation units.
Z-test
•A z-test is a test that determines if two populations' means are different. To use a z-test, you
need to know the differences in variances and have a large sample size
Sign test:
•A test that can be used whenever an experiment is conducted to compare a treatment with a
control on a number of matched pairs, provided the two treatments are assigned to the
members of each pair at random.

PART 1: COMMON STATISTICAL TERMS
Student t-test
•A student t-test is a hypothesis that tests the mean of a small sample with a bell curve where you
don’t know the standard deviation. This can include correlated means, correlation, independent
proportions or independent means.
T-distribution
•T-distribution means when the population standard deviation is unknown and the data originates
from a bell-curve population, it describes the standardized deviations of the mean of the sample
to the mean of the population.
Standard error of the mean (SEM):
•An estimate of the amount by which an obtained mean may be expected to differ by chance
from the true mean. It is an indication of how well the mean of a sample estimates the mean of a
population

PART 1: COMMON STATISTICAL TERMS
Variance (SD2 ):
•A measure of the
dispersion of a set of
data points around their
mean value.
•It is a mathematical
expectation of the
average squared
deviations from the mean
Analysis of covariance
(ANCOVA):
•A statistical technique for
equating groups on one
or more variables when
testing for statistical
significance using the F-
test statistic.
•It adjusts scores on a
dependent variable for
initial differences on
other variables, such as
pretest performance or
IQ. *PT
Analysis of variance
(ANOVA):
•A statistical technique for
determining the statistical
significance of
differences among
means; it can be used wit

PART 1: COMMON STATISTICAL TERMS
Effect size
•Effect size is a statistical term that quantifies the degree of a relationship between
two given variables. For example, we can learn about the effect of therapy on
anxiety patients.
•The effect size aims to determine whether the therapy is highly successful or mildly
successful.
Measures of variability
•Measures of variability, also referred to as measures of dispersion, denote how
scattered or dispersed a database is.
•Four main measures of variability are the interquartile range, range, standard
deviation and variance.

PART 1: COMMON STATISTICAL TERMS
Median test
•A median test is a nonparametric test that tests two independent groups that have
the same median.
•It follows the null hypothesis that each of the two groups maintains the same median.
Population
•Population refers to the group you’re studying. This might include a certain
demographic or a sample of the group, which is a subset of the population.
Parameter
•A parameter is a quantitative measurement that you use to measure a population.
•It’s the unknown value of a population on which you conduct research to learn more.

PART 1: COMMON STATISTICAL TERMS
Post hoc test
•Researchers perform a post hoc test only after they’ve discovered a statistically
relevant finding and need to identify where the differences actually originated.
Probability density
•The probability density is a statistical measurement that measures the likely
outcome of a calculation over a given range.
Random variable
•A random variable is a variable in which the value is unknown.
•It can be discrete or continuous with any value given in a range.

PART 1: COMMON STATISTICAL TERMS
Chi-square (²):
•A nonparametric test of statistical significance appropriate when the data are in the form of
frequency counts; it compares frequencies actually observed in a study with expected
frequencies to see if they are significantly different.
Coefficient of determination (r²):
•The square of the correlation coefficient (r), it indicates the degree of relationship strength
by potentially explained variance between two variables.
Cohen’s d:
•A standardized way of measuring the effect size or difference by comparing two means by
a simple math formula. It can be used to accompany the reporting of a t-test or ANOVA
result and is often used in meta-analysis.
•The conventional benchmark scores for the magnitude of effect sizes are as follows: small, d
= 0.2; medium, d = 0.5; large, d = 0.8

PART 1: COMMON STATISTICAL TERMS
Cronbach’s alpha coefficient ():
•A coefficient of consistency that measures how well a set of variables or items measures a single,
unidimensional, latent construct in a scale or inventory.
•Alpha scores are conventionally interpreted as follows: high, 0.90; medium, 0.70 to 0.89; and
low, 0.55 to 0.69
F-test (F):
•A parametric statistical test of the equality of the means of two or more samples. It compares the
means and variances between and within groups over time. It is also called analysis of variance
(ANOVA)
Tukey’s test of significance:
•A single-step multiple comparison procedure and statistical test generally used in conjunction with
an ANOVA to find which means are significantly different from one another.
•Named after John Tukey, it compares all possible pairs of means and is based on a studentized
range distribution q (this distribution is similar to the distribution of t from the t-test).

PART 1: COMMON STATISTICAL TERMS
Fisher’s exact test:
•A nonparametric statistical significance test used in the analysis of contingency tables where
sample sizes are small.
•The test is useful for categorical data that result from classifying objects in two different
ways; it is used to examine the significance of the association (contingency) between two
kinds of classifications
Wald-Wolfowitz test:
•A nonparametric statistical test used to test the hypothesis that a series of numbers is
random. It is also known as the runs test for randomness
Wilcoxon sign rank test (W+ ):
•A nonparametric statistical hypothesis test for the case of two related samples or repeated
measurements on a single sample. It can be used as an alternative to the paired Student’s t-
test when the population cannot be assumed to be normally distributed.

PART 1: COMMON STATISTICAL TERMS
Independent t-test:
•A statistical procedure for comparing measurements of mean scores in two
different groups or samples.
•It is also called the independent samples t-test. *
Kendall’s tau:
•A nonparametric statistic used to measure the degree of correspondence
between two rankings and to assess the significance of the correspondence.
Kolmogorav-Smirnov (K-S) test:
•A nonparametric goodness of-fit test used to decide if a sample comes
from a population with a specific distribution.
•The test is based on the empirical distribution function (ECDF)

PART 1: APPLICATION OF BIOSTATISTICS
1. Clinical Trials
•One of the most impactful applications of biostatistics is in the design and analysis of clinical
trials.
•Biostatisticians ensure the validity and reliability of trial results, which helps researchers assess
the safety and efficacy of new drugs and treatments.
•Using various statistical methods, we analyze patient data to draw conclusions that will help in
making medical decisions.
2. Epidemiology
•In the field of epidemiology, biostatistics aids in studying the distribution and determinants of
diseases within populations.
•Biostatisticians use different statistical models to analyze patterns, identify risk factors, and
assess the impact of interventions.
•This information is crucial for public health planning and for developing disease prevention
strategies.

PART 1: APPLICATION OF BIOSTATISTICS
3. Genetics and Genomics
•Biostatistics is indispensable in the analysis of genetic and genomic data.
•Researchers use statistical methods to identify genes associated with specific diseases,
understand the heritability of these genes, and figure out complex genetic interactions.
•This application of biostatistics is instrumental in advancing our understanding of what
is the genetic basis of various medical conditions.
4. Public Health Policy
•Biostatistics contributes significantly to the formulation and evaluation of public health
policies.
•By analyzing health data, biostatisticians can assess the effectiveness of interventions,
evaluate health disparities, and guide policymakers in making informed decisions to
improve public health outcomes.

PART 1: APPLICATION OF BIOSTATISTICS
5. Environmental Health
•Biostatistics is applied in environmental health studies to analyze the impact of environmental factors on
human health. Whether it is assessing the effects of air quality on respiratory diseases or studying the
correlation between water contaminants and health outcomes, biostatistics helps decode the complex
relationships in environmental health research.
6. Bioinformatics
•This is the era of big data, biostatistics plays a crucial role in bioinformatics, where vast amounts of
biological data are analyzed to extract meaningful patterns. Biostatisticians develop statistical methods
and algorithms to interpret data from genomics, proteomics, and other ‘omics’ technologies. And the result
is visible in the form of advancements in personalized medicine and drug discovery.
7. Quality Control in Healthcare
•Biostatistics is also employed in quality control processes within healthcare systems. It ensures the accuracy
and reliability of medical tests, monitors healthcare processes, and helps identify areas for improvement.
This application is vital for maintaining high standards of patient care

PART 1: APPLICATION OF BIOSTATISTICS
8. Create population-based interventions
•Researchers can use biometric techniques to assess the impact of a
health programme on the target population.
•With biometric techniques, researchers can use insights from data to:
•Measure the performance of public health interventions
•Boost immunisation rates
•Increase the number of patients attending post-surgery appointments
•Improve training and supervision of health care professionals
standards of patient care

PART 1: APPLICATION OF BIOSTATISTICS
9. Create population-based interventions
•Biometrics can also help researchers, health care providers and public health
administrators to create population-based health interventions based on the results
of biostatistical data analysis and interpretation.
These data insights can be used to:
•Identify populations that require interventions to reduce their exposure to specific
health problems
•Identify areas susceptible to high risk of certain diseases
•Identify the factors influencing the high cases of health disparities within a
population
•Identify members of a population that require the highest level of health care

PART 1: APPLICATION OF BIOSTATISTICS
10. Control epidemics
•Biostatistical techniques can also help public health officials, health care practitioners and
epidemiologists to control epidemics.
•Researchers not only use statistical analysis to understand how diseases spread, but they can also
use it to determine the mortality rate amongst specific populations.
•It can also help health care professionals determine the most at-risk members of the population and
create a framework for formulating strategies to stop the spread of such diseases.
11. Identify barriers to care
•Researchers and health care professionals can use biostatistical methods to learn about the barriers
preventing people from getting access to quality care.
•Researchers use surveys to identify the factors that limit access to health care. Medical records,
interviews and claims records can show patient perceptions about health care services, providing
insights to make such services more accessible and acceptable to target populations for higher
efficiency.

PART 1: APPLICATION OF BIOSTATISTICS
12. Study demography
•Demography is the statistical study of the human population.
•The field uses statistical techniques to describe births, deaths, income, disease disparity and other
structural changes in human populations.
•Using census data, surveys and statistical models, biostatisticians can analyse the structure, size and
movement of populations, providing insights for government agencies, health care administrators, town
planners and other stakeholders to create and adjust their plans based on the dynamics of the population
13. Derive conclusions about populations from samples
•One major importance of biostatistical methods is that they help researchers derive far-reaching
conclusions about a population from samples.
•Due to several factors, such as finances, size and time constraints, it's not always possible for researchers
to collect data about an entire population when testing assumptions about them.
•Biostatistical methods provide researchers and administrators with the tools they require to select a
sample that's representative of the population, choose the right independent and dependent variables
and derive logical conclusions from the data

PART 1: APPLICATION OF BIOSTATISTICS
14. Check drug efficacy
•In the medical and pharmaceutical fields, biostatistical research is used to check the
efficacy and effectiveness of treatments during clinical trials.
•Researchers can also use it to find possible side effects of drugs.
•These methods are ideal for conducting drug treatment trials and performing other
experiments to understand the impact of different medications and medical devices on
the human body
15. Perform genetics studies
•It's an important discipline in the study of Mendelian genetics.
•Geneticists use it to study the inheritance patterns of genes.
•They also use it to study the genetic structure of a population.
•Researchers also use biometry to map chromosomes and understand the behaviour of
genes in a population.

PART 1: APPLICATION OF BIOSTATISTICS
16. Other applications
•Determining leading causes of death
and burden of disease
•Health status of the population
•Morbidity patterns

PART 1: APPLICATIONS OF BIOSTATISTICS
Predictive modelling
•In public health, predictive modeling is a pivotal aspect of biostatistics.
•This statistical process utilizes existing data to forecast future events, uncovering
patterns and trends.
•Is applied in epidemiology for screening individuals prone to specific diseases.
•For instance in breast cancer, where factors like age, race, family history, and more
are analyzed to gauge the risk.
•Predictive modeling plays a crucial role in preventing breast cancer-related deaths
by identifying individuals who may need preventive or treatment measures.
•Beyond cancer and pandemics, this approach extends to various public health
concerns, showcasing its versatility in foreseeing and addressing health challenges.

PART 1: APPLICATIONS OF BIOSTATISTICS
Decision-making
•Healthcare leaders
•Researchers
•Policymakers.
Operational Viability:
•Biostatistics provides the necessary data to assess the operational feasibility of new ideas and initiatives.
•It helps in making informed decisions about acquisitions, tool prototypes, and hiring strategies, setting the parameters for
project scopes and methodologies.
Guarding Against Bias:
•Biostatistical studies undergo rigorous examination to detect and eliminate bias.
•Public health’s commitment to equitability ensures that data collection processes are designed to be fair and objective,
preventing unfair conclusions.
Protecting Data Subjects:
•Biostatistical researchers prioritize the protection of data subjects. Personal information collected for public health
research is anonymized and safeguarded, addressing privacy concerns and mitigating risks associated with unsecured
data.

PART 1: DATA CLASSIFICATION
The main objectives of Classification of Data are
as follows:
•Explain similarities and differences of data
•Simplify and condense data’s mass
•Facilitate comparisons
•Study the relationship
•Prepare data for tabular presentation
•Present a mental picture of the data

PART 1: DATA CLASSIFICATION
There are different types of data classification,
depending on the characteristics.
•Structured and Unstructured
•Primary or Secondary
•Qualitative and Quantitative.
•Number of variables: Univariate, Bivariate, Multivariate
Classifying data is an important step to ensure proper
analysis

PART 1: DATA CLASSIFICATION
Univariate data
•This type of data consists of only one variable.
•The analysis of univariate data is thus the simplest form of
analysis since the information deals with only one quantity that
changes.
•It does not deal with cause sor relationships and the main
purpose of the analysis is to describe the data and find
patterns that exist within it.
•The example of a univariate data can be height.

PART 1: DATA CLASSIFICATION
Bivariate data
This type of data involves two different variables.
The analysis of this type of data deals with causes and relationships and the analysis is done to find
out the relationship among the two variables.
Example of bivariate data can be temperature and ice cream sales in summer season.
Bivariate data analysisinvolvescomparisons, relationships, causes and explanations

PART 1: DATA CLASSIFICATION
Multivariate data
•When the data involves three or more variables, it is categorized under
multivariate.
•Example of this type of data is suppose an advertiser wants to compare
the popularity of four advertisements on a website, then their click rates
could be measured for both men and women and relationships between
variables can then be examined.
•It is similar to bivariate but contains more than one dependent variable
•The ways to perform analysis on this data depends on the goals to be
achieved.
•Some of the techniques are regression analysis, path analysis, factor
analysis and multivariate analysis of variance (MANOVA)

DATA ANALYSIS
Choice of method
•Size
•Complexity
•Number of variables
•Nature of variables
•Study objectives
•Research questions
•Hypothesis

PART 1: DATA ANALYSIS
Descriptive analysis
•Suitable for analyzing and presenting data, such as
mean and median.
Inferential
•To establish functional relationship between variables,
more advanced analytical techniques, such as
correlation and regression

PART 1: DATA INTERPRETATION
Data interpretation
•Involves inferring conclusions from the results of
data analysis.
•This exercise allows researchers to categorise,
manipulate and summarise their findings to
answer important questions in public health,
biology and medicine.

PART 1: PRIMARY AND SECONDARY DATA
Definition
•Primary data are the original data derived from research endeavors
and collected through methods such as direct observation, indirect
observation, interviews, questionnaire
•Secondary data are data derived from primary data and sources
include published reports, journal articles, news papers
•Often, the distinction between primary and secondary data may be
less than clear.
•In conducting research, both types of data are collected and created
•It is essential to have plan for the management of all types of data
and primary materials

PART1: SOURCES OF EPIDEMIOLOGICAL DATA
Epidemiologists use primary and secondary data sources to calculate
rates and conduct studies.
•Primary data is the original data collected for a specific purpose by or for an
investigator. For example, an epidemiologist may collect primary data by interviewing
people who became ill after eating at a restaurant in order to identify which specific
foods were consumed.
•Collecting primary data is expensive and time-consuming, and it usually is
undertaken only when secondary data is not available.
•Secondary data is data collected for another purpose by other individuals or
organizations.
•Examples of sources of secondary data that are commonly used in epidemiological
studies include birth anddeathcertificates,populationcensusrecords, patient medical
records,diseaseregistries, insurance claim forms and billing records,public
healthdepartment case reports, and surveys of individuals and households

PART 1: PRIMARY AND SECONDARY DATA
Primary Materials Primary Data Secondary Data
Interview schedules Interview audio recordings
Surveys
Experiments
Nvivo interview transcripts
Purchased laboratory
reagents
Investigational productProduct analyses
Research animals Tissue samples Stained slides
Validated questionnairesCompleted paper and
pencil questionnaires
SPSS data files containing
raw data and calculated
variable summary scores

PART 1: PRIMARY AND SECONDARY DATA
BASIS FOR COMPARISON PRIMARY DATA SECONDARY DATA
Meaning Primary data refers to the first
hand data gathered by the
researcher himself.
Secondary data means data
collected by someone else earlier.
Data Real time data Past data
Process Very involved Quick and easy
Source Surveys, observations,
experiments, questionnaire,
personal interview, etc.
Government publications, websites,
books, journal articles, internal
records etc.
Cost effectiveness Expensive Economical
Collection time Long Short
Specific Always specific to the researcher's
needs.
May or may not be specific to the
researcher's need.
Available in Crude form Refined form
Accuracy and Reliability More Relatively less

Part 1: RAW DATA

PART 1: ELEMENTS, OBSERVATIONS,VARIABLES, DATA
Element
•Entities or units on which data are collected such as person, place, or object
Observation
•Set of measurements or observations related to a particular element
Variable
•Character or attribute of interest on a particular element and which takes on different values
Total number of data values
•The number of elements times the number of variables
Data
•Is a specific measurement of a variable –it is the value you record in your data sheet.

PART 1: QUANTITATIVE AND QUALITATIVE
VARIABLES
Data is generally divided into two categories:
•Quantitative datarepresents amounts
•Qualitative or Categorical datarepresents groupings
•A variable that contains quantitative data is
aquantitative variable;
•A variable that contains categorical data is
acategorical variable.

PART 1: QUANTITATIVE AND QUALITATIVE
VARIABLES
Quantitative variables
•The numbers recorded represent real amounts
that can be added, subtracted, divided, etc.
•There are two types of quantitative variables:
•Discreteandcontinuous.

DATA TAXONOMY
Structured
Qualitative
Nominal
Ordinal
Quantitative
Discrete
Continuous
Unstructured
Text
Digital
Analogue
Image
Digital
Analogue

Indicate whether each of the following variables is discrete or continuous:thetime it takes for you to
get to school
the number of Canadian couples who were married last year
the number of goals scored by a women’s hockey team
the speed of a bicycle
your age
the number of subjects your school offered last year
the length of time of a telephone call
the annual income of an individual
the distance between your house and school
the number of pages in a dictionary

PART 1: QUANTITATIVE VARIABLES
Type of variable What does the data
represent?
Examples
Discrete variables(aka
integer variables)
Counts of individual
items or values.
•Number of students in a
class
•Number of different
tree species in a forest
Continuous
variables(aka ratio
variables)
Measurements of
continuous or non-finite
values.
•Distance
•Volume
•Age

PART 1: QUALITATIVE VARIABLES
Qualitative variables
•Categorical variables represent groupings of some kind.
•They are sometimes recorded as numbers, but the numbers represent
categories rather than actual amounts of things.
•There are three types of categorical variables:
•Binary, nominal, and ordinal variables.
•Sometimes a variable can work as more than one type
•An ordinal variable can also be used as a quantitative variable if the scale is
numeric and doesn’t need to be kept as discrete integers.
•For example, star ratings on product reviews are ordinal (1 to 5 stars), but
the average star rating is quantitative.

PART 1: QUALITATIVE VARIABLES
Type of variable What does the data
represent?
Examples
Binary variables(aka
dichotomous variables)
Yes or no outcomes. •Heads/tails in a coin flip
•Win/lose in a football game
Nominal variables Groups with no rank or order
between them.
•Species names
•Colors
•Brands
Ordinal variables Groups that are ranked in a
specific order.
•Finishing place in a race
•Rating scale responses in a
survey, such asLikert scales

PART 1: INDEPENDENT AND DEPENDENT
VARIABLES
Independent vs dependent variables
•Experiments are usually designed to find out whateffectone
variable has on another for instance the effect of salt addition on
plant growth.
•The independent variable(the one you think might be thecause) is
manipulated and then thedependent variable(the one you think
might be theeffect) is measured to find out what this effect might
be.
•There are variables that you hold constant (control variables) in
order to focus on your experimental treatment.

PART 1: INDEPENDENT AND DEPENDENT VARIABLES
Independent vs dependent vs control variables
Type of variable Definition Example (salt tolerance
experiment)
Independent variables(aka
treatment variables)
Variables you manipulate in
order to affect the outcome of an
experiment.
The amount of salt added to each
plant’s water.
Dependent
variables(akaresponse
variables)
Variables that represent the
outcome of the experiment.
Any measurement of plant health
and growth: in this case, plant
height and wilting.
Control variables Variables that are held constant
throughout the experiment.
The temperature and light in the
room the plants are kept in, and
the volume of water given to each
plant.

OTHER COMMON TYPES OF VARIABLES
Other types
Definition of the independent and dependent variables and determination of whether they are
categorical or quantitative enables choice of the correct statistical test.
Type of variable Definition Example (salt tolerance experiment)
Confounding variablesA variable that hides the true effect of another variable
in an experiment.
This can happen when another variable is closely related
to a variable you are interested in, but you haven’t
controlled it in your experiment. Be careful with these,
because confounding variables run a high risk of
introducing a variety ofresearch biasesto your work,
particularlyomitted variable bias.
Pot size and soil type might affect plant survival as much
or more than salt additions. In an experiment you would
control these potential confounders by holding them
constant.
Latent variables A variable that can’t be directly measured, but that you
represent via a proxy.
Salt tolerance in plants cannot be measured directly, but
can be inferred from measurements of plant health in our
salt-addition experiment.
Composite variables A variable that is made by combining multiple variables
in an experiment. These variables are created when you
analyze data, not when you measure it.
The three plant health variables could be combined into
a single plant-health score to make it easier to present
your findings

PART 1: VARIABLES RESEARCH
82
NoVariable Type MeasurementScaleCategories
1Age (years) Independent Interval -
2 Weight (kg) Independent Interval -
3Serum creatinine (μmol/L) Independent Interval -
4Blood cholesterol (mmol/L) Independent Interval -
5Serum triglyceride (mmol/L) Independent Interval -
6Blood uric acid (μmol/L) Independent Interval -
7Fast blood glucose (mmol/L) Independent Interval -
8Systolic blood pressure (mmHg)Independent Interval -
9Diastolic blood pressure (mmHg)Independent Interval -
10Hemoglobin (g/L) Independent Interval -
11Hematocrit Independent Interval -

PART 1: VARIABLES IN RESEARCH
83
NoVariable Type MeasurementScaleCategories
12BMI Independent Interval -
13≥High school education Independent Nominal Above/Under
14Health insurance coverage Independent Nominal Yes/No
15Smoking Independent Nominal Yes/No
16History of CKD Independent Nominal Yes/No
17Family history of diabetes Background Nominal Yes/No
18Family history of hypertensionBackground Nominal Yes/No
19Family history of CKD Background Nominal Yes/No
20Repeatedly respiratory tract infectionBackground Nominal Yes/No
21Nephrotoxic medications Independent Nominal Yes/No
22Obesity Independent Nominal Yes/No

PART 1: VARIABLES IN RESEARCH
84
NoVariable Type MeasurementScaleCategories
23Central obesity Independent Nominal Yes/No
24Metabolic syndrome Independent Nominal Yes/No
25Hypertension Independent Nominal Yes/No
26Diabetes Independent Nominal Yes/No
27Hyperlipidemia Independent Nominal Yes/No
28Hyperuricemia Independent Nominal Yes/No
29Cardiovascular disease Independent Nominal Yes/No
30eGFR<60 mL/min/1.73 m2 Independent Nominal Yes(<60)/No(>60)
31ACR >30 mg/g Independent Nominal Yes(>30)/No(<30)
32Hematuria Independent Nominal Yes/No
33CKD Status Dependent Nominal Yes/No

PART 1: DISCUSS THE CATEGORIZATION OF THE
FOLLOWING VARIABLES
Number of all hospital discharges
Acute care hospital discharges per 100
Number of acute care hospital discharges
Inpatient surgical procedures per year per 100 000
Total number of inpatient surgical procedures per
year
Average length of hospital stay
Bed occupancy rate (%)
Outpatient contacts per person per year
Autopsy rate (%) for hospital deaths
Inpatient care discharges per 100
Turn over rate
Outpatient/In-patient ration
Number of surgeries
Number of deliveries
Number of x-rays/scans
Number of lab tests
Number of beds per capita
Number of

PART 1: QUALITATIVE RESEARCH METHODS
Method Overall Purpose Advantages Challenges
Surveys •Quickly and/or easily gets lots
ofinformation from people in a non
threatening way
•Can complete anonymously
•Inexpensive to administer
•Easy to compare and analyze
•Administer to many people
•Can get lots of data
•Many sample questionnaires already
exist
•Might not get careful feedback
•Wording can bias client's responses
•Impersonal
•May need sampling expert
•Doesn't get full story
Interviews•Understand someone's impressions
or experiences
•Learn more about answers to
questionnaires
•Get full range and depth of
information
•Develops relationship with client
•Can be flexible with client
•Can take time
•Can be hard to analyze and compare
•Can be costly
•Interviewer can bias client's responses
Observation•Gather firsthand information about
people, events, or programs
•View operations of a program as
they are actually occurring
•Can adapt to events as they occur
•Can be difficult to interpret seen
behaviors
•Can be complex to categorize
observations
•Can influence behaviors of program
participants
•Can be expensive

PART 1: QUALITATIVE RESEARCH METHODS
Method Overall Purpose Advantages Challenges
Focus Groups •Explore a topic in depth
through group discussion
•quickly and reliably get
common impressions
•can be efficient way to get
much range and depth of
information in short time
•can convey key information
about programs
•can be hard to analyze
responses
•need good facilitator for
safety and closure
•difficult to schedule 6-8
people together
Case Studies •Understand an experience
or conduct comprehensive
examination through cross
comparison of cases
•depicts client's experience in
program input, process and
results
•powerful means to portray
program to outsiders
•usually time consuming to
collect, organize and
describe
•represents depth of
information, rather than
breadth

FACTORIALS
These provide an easier way of wring large numbers in forms
Just like in mathematics where we can use bases to write large numbers
For instance 10 to base 2 =1010
100 to base 2 =1100100
10!=10x9x8x7x6x5x4x3x2x1=

PART 1: FACTORIALS
The Factorial of a whole number 'n' is defined as the product of that number with
every whole number less than or equal to 'n' till 1.
For example, the factorial of 4 is 4 ×3 ×2 ×1, which is equal to 24. It is
represented using the symbol
5 factorial, that is, 5! can be written as: 5! = 5 ×4 ×3 ×2 ×1 = 120.
The formulas for n factorial are:
n! = n(n-1)(n-2)…………………… .(3)(2)(1)
n! = n ×(n -1)!

PART 1: FACTORIALS
5!=5x4x3x2x1
6!=6x5x4x3x2x1
0!=1

PART 1: FACTORIALS

PERMUTATION AND COMBINATION
Both are about rearranging
numbers or objects to change their
position
For instance a set of numbers like
123 can be rearranged in different
ways e.g
123
132
321
312
213
231
Both are about rearranging
numbers or objects to change
their position
For instance a set of numbers
like 12 can be rearranged in
different ways e.g
12
21
Both are about rearranging
numbers or objects to change
their position
For instance a set of numbers
like 1 can be rearranged in
different ways e.g
1

A set {1 2 3 4 5 6 7 8 9}
Permutation
123
Where n=3, and r =3
nPr=3P3=n!/(n-r)!=3!/(3-3)!=3!/0!=(3X2X1)/1=6/1=6
Combination
nCr
Where n=3 and r =3
nCr=n!/[(n-r)! X r!=3!/[(3-3)!x3!=(3x2x1)/{(0!) x (3x2x1)]=6/6=1

PREMIER LEAGUE EXAMPLE
The set of 20 clubs which are unique
They play in pairs meaning 2 at a time
That is r=2
Then n=20
For permutation, the number of unique pairs given that the order matters i.ehome
away, therefore the pairs =nPr=20P2=20!/(20-2)!=20!/18!=

PART 1: USE OF FACTORIAL
Use of Factorial
•One area where factorials are widely used is inpermutations &
combinations.
•Permutationis an ordered arrangement of outcomes and it can be
calculated with the formula:
n
P
r= n! / (n -r)!
•Combinationis a grouping of outcomes in which order does not
matter. It can be calculated with the formula:
n
C
r= n! / [ (n -r)! r!]
•In both of these formulas, 'n' is the total number of things available
and 'r' is the number of things that have to be chosen.

PART 1: FACTORIAL
Set {1,2,3)has three elements meaning n=3
•Permutation i.e the order in which the elements of the sub-set matters
and makes a difference: 1,2; 1,3; 2,3; 2,1; 3,1;3,2, the number of
subsets of twos is equal to 6
•nPr=n!/(n-r)!, wherE n is the total number of elements in the mother set
or population and r is the number of elements in the subset.
•For example if we had 20 elements in the mother set and we are
picking two at a time, the by permutation we will process as follows:
20P2
•Therefore 20P2=20!/(20-2)!=20!/18!=380

PART 1: FACTORIAL
Set {1,2,3)has three elements meaning n=3
•Combination i.e the order in which the elements of the sub-set does not
matter and makes no difference: 1,2; 1,3; 2,3 the number of subsets
of twos is equal to 3
•nCr=n!/(n-r)!r!, where n is the total number of elements in the mother
set or population and r is the number of elements in the subset.
•For example if we had 20 elements in the mother set and we are
picking two at a time, the by permutation we will process as follows:
20P2
•Therefore 20C2=20!/(20-2)!2!=20!/18!2!=190

PART 1: PERMUTATION-THE ORDER MATTERS
{1, 2, 3} arrange in pairs using permutation i.e. the order should be respected and
matters
1,2; 1,3, 2, 3, 2,1, 3,1. 3,2=6 pairs
{1,2,3,4}
1,2; 1,3; 1,4; 2,3; 2,4; 3,4; 2,1; 3,1;4,1;3,2;4,2;4,3;=12 pairs
Premiere league we have 20 clubs
n
P
r= n! / (n-r)!=
20
P
2= 20! / (20 -2)!=20!/18!

PART 1: COMBINATION –ORDER DOES NOT
MATTER
{1,2,3}
1,2; 2,3,1,3=3 pairs
n
C
r= n! / [ (n-r)! 2!]
3
C
2 = 3! / [ (3 -2)! 2!]
=3!/[1!x2!]=6/2=3
For premiere league with 20 clubs if the games were one way only then the number
of games would be =20!(18!x2!)=

PART 1: COMBINATIONS
n
C
r= n! / [ (n-r)! 2!], where r is the number of elements we pick for arrangement
at a time
30
C5= 30! / [ (30-5)! 2!]
=30! / [ (25)! 2!]

PART 1: USE OF FACTORIALS
Example 1:How many 5-digit numbers can be formed using the digits 1, 2, 5, 7, and
8 in each of which no digit is repeated?
Solution:
The given 5 digits (1, 2, 5, 7 and 8) should be arranged among themselves in order
to get all possible 5-digit numbers.
The number of ways for doing this can be done by calculating the 5 factorial.
5! = 5 ×4 ×3 ×2 ×1 = 120
Answer:Therefore, the required number of 5-digit numbers is 120.

PART 1: USE OF FACTORIAL
Example 2v:In a group of 10 people, $200, $100, and $50 prizes are to be given. In how many
ways can the prizes be distributed?
Solution:
This is permutation because here the order of distribution of prizes matters. It can be calculated
as
10
P
3ways.
10
P
3= (10!) / (10 -3)! = 10! / 7! = (10 ×9 ×8 ×7!) / 7! = 10 ×9 ×8 = 720 ways.
Example 3:Three $50 prizes are to be distributed to a group of 10 people. In how many ways
can the prizes be distributed?
Solution:
This is a combination because here the order of distribution of prizes does not matter (because all
prizes are of the same worth). It can be calculated using
10
C
3.
10
C
3= (10!) / [ 3! (10 -3)!] = 10! / (3! 7!) = (10 ×9 ×8 ×7!) / [(3 ×2 ×1) 7!] = 120 ways.

PERMUTIONAND COMBINATION
Difference between Permutation and Combination
Permutation Combination
The different ways of arranging a set of
objects into a sequential order are termed as
Permutation.
One of the several ways of choosing items from
a large set of objects, without considering an
order is termed as Combination.
The order is very relevant. The order is quite irrelevant.
It denotes the arrangement of objects.It does not denote the arrangement of objects.
Multiple permutations can be derived from a
single combination.
From a single permutation, only a single
combination can be derived.
They can simply be defined as ordered
elements.
They can simply be defined as unordered set

PART 1: UNIVARIATE AND BIVARIATE DATA
Univariate data
•Data on ne variable
•Examples include height, skin colour, ethnicity, service coverage
Bivariate data
•Data where two variables are being compared for correlation or
causation
•Correlation =height and body weight; age and body weight
•Causation such as obesity and heart disease

PART 1: UNIVARIATE AND BIVARIATE DATA
Univariate analysis
•Summary statistics
•Central tendency
•Dispersion
•Frequency distribution
•Bar charts
•Histogram
•Pie chart

PRACTICE QUESTIONS
1.Explain why a sample statistic (the estimate from the sample) may differ from the
population parameter (the true value) and how you would minimize the difference.
2.A local coffee shop is creating a spreadsheet of their drinks for customers to view
on their website. The spreadsheet includes the calories, sugar content, and
ingredients for each coffee drink. Which of the following would be considered a
variable in this data set?
Answers:
The Calories
The Customers
The Coffee Shop
The Coffee Drink
What are the other variables in the passage?

PRACTICE QUESTIONS
1.A political pollster is conducting a survey about voter's affiliation to a major
political party. He selects a random sample of voters who voted in the last
presidential election, and looks into how party affiliation differs based on age,
race, gender and location. How many variables can you identify in this data set?
Answers:
A.5
B.6
C.4
D.7

PART 1: SCALES OF MEASUREMENT
Rationale
•In order to analyze data, the variables have to be defined and categorized using
different scales of measurements.
•There are four scales of measurements-nominal scale, ordinal scale, interval scale,
and ratio scale.
•The scale of measurement of a variable determines the kind of statistical test to be
used.
•Psychologist Stanley Stevens developed the four common scales of measurement:
nominal, ordinal, interval and ratio.
•1. Nominal scale
•2. Ordinal scale
•3. Interval scale
•4. Ratio scale

PART 1: SCALES OF MEASUREMENT
Properties and scales of measurement
•Each scale of measurement has properties that determine how to properly analyse the data.
•The properties evaluated are identity, magnitude, equal intervals and a minimum value of
zero.
Properties of Measurement
•Identity: Identity refers to each value having a unique meaning.
•Magnitude: Magnitude means that the values have an ordered relationship to one another, so
there is a specific order to the variables.
•Equal intervals: Equal intervals mean that data points along the scale are equal, so the
difference between data points one and two will be the same as the difference between data
points five and six.
•A minimum value of zero: A minimum value of zero means the scale has a true zero point.
Degrees, for example, can fall below zero and still have meaning. But if you weigh nothing,
you don’t exist.

PART 1: STATISTICAL LEVELS OF
MEASUREMENT
Nominal-level Measurement
•There’s no numerical or quantitative value, and
qualities are not ranked.
•Nominal-level measurements are instead simply
labels or categories assigned to other variables.
•It’s easiest to think of nominal-level measurements
as non-numerical facts about a variable.

SCALES OF MEASUREMENT
Nominal scale,
•Also known as categorical variable scale, can be defined as a scale used for
labelling variables into different categories.
•The numbers are used to identify and classify people, objects or events, like
identity number, jersey number of sportspersons, and vehicle registration
number; thus, they have no specific numerical value or meaning. I
•In research, the nominal scale is used for analysing categorical variables such
as gender, place of residence, marital status, political party, blood group
and so on.
•The interval between numbers and their order does not matter on the
nominal scale

SCALES OF MEASUREMENT
Nominal scale:
•A nominal scale preserves only the equality property; there is no
‘more or less than’ relation in this measurement.
•Thenominal scale of measurementdefines the identity property
of data.
•This scale has certain characteristics, but doesn’t have any form
of numerical meaning.
•The data can be placed into categories but can’t be multiplied,
divided, added or subtracted from one another.
•It’s also not possible to measure the difference between data
points

SCALES OF MEASUREMENT
Nominal scale:
•The statistical analysis that can be performed on a nominal scale is the
frequency distribution and percentage.
•It can be analyzed graphically using a bar chart or a pie chart. If there are two
categorical variables, quantitative analysis techniques such as joint frequency
distribution and cross-tabulation can be used.
•Mode is the only measure of central tendency which can be used in this scale.
•Since numbers do not have a quantitative value, addition, subtraction,
multiplication, division, and measures of dispersion cannot be applied.
•It is also possible to perform contingency correlation. Hypothesis tests can be
carried out on data collected in the nominal form using the Chi-square test. It can
tell whether there is an association between the variables.
•However, it cannot establish a cause and effect relationship or explain the form
of relationship.

PART 1: STATISTICAL LEVELS OF
MEASUREMENT
Ordinal-level Measurement
•Outcomes can be arranged in an order, but all data
values have the same value or weight.
•Although they’re numerical, ordinal-level measurements
can’t be subtracted against each other in statistics
because only the position of the data point matters.
•Ordinal levels are often incorporated into nonparametric
statistics and compared against the total variable group.

SCALES OF MEASUREMENT
Ordinal scale
•is a ranking scale in which numbers are assigned to variables to represent their
rank or relative position in the data set.
•The variables are arranged in a specific order rather than just naming them.
•So they can be named, grouped, and ranked.
•In research, the ordinal scale is used for ranking students in a class (1,2,3), rating
a product satisfaction (very unsatisfied-1, unsatisfied-2, neutral-3, satisfied-4,
very satisfied-5), evaluating the frequency of occurrences (very often-1, often-2,
not often-3, not at all-4), assessing the degree of agreement (totally agree-1,
agree-2, neutral-3, disagree-4, totally disagree-5
•In this scale, the attributes are arranged in ascending or descending order. The
numbers indicate rank or the order of quality or quantity.

SCALES OF MEASUREMENT
Ordinal Scale:
•The origin of scale is absent because there is no fixed start or ‘true zero’ in the data.
•Hence, it is impossible to find the magnitude of difference or distance between the variables or their
degree of quality.
•For example, while ranking students in terms of potential for an award, a student labelled ‘1’ is better
than the student labelled ‘2’, ‘2’ is better than ‘3’ and so forth.
•However, this ordinal scaling cannot quantify or indicate how much better the second student to the first
student, or the difference between the potential of first and second students, the same as the difference
between the second and third.
•Similarly, very satisfied will always be better than satisfied and unsatisfied will be better than very
unsatisfied.
•The order of variables is of prime importance, and so is the labelling.
•The ordinal scale is the second level of measurement from a statistical point of view.
•These scales are unique up to a monotone transformation. A monotone transformation T is one that assigns
new values such that if f(X) > f(Y) in the ordinal scale, then T(f(X)) > T(f(X)) in the newly transformed scale

SCALES OF MEASUREMENT
Ordinal Scale:
•The ordinal data can be presented using tabular or graphical formats.
•The descriptive analysis such as percentile, quartile, median and mode
can be determined in ordinal scale data. Since the interval between
numbers is insignificant, addition, subtraction, multiplication, division, and
measures of dispersion cannot be applied.
•It is possible to test for order correlation using Spearman's rank
correlation coefficient.
•Non-parametric tests such as Mann-Whitney U test, Friedman’s ANOVA,
Kruskal–Wallis H test can also be used to analyze ordinal scale data

SCALES OF MEASUREMENT
Interval Scale
•can be defined as a quantitative scale in which both the order and the exact difference
between categories are known.
•Thus it measures variables that can be labelled, ordered, and have an equal interval.
•However, the point of beginning or zero point on an interval scale is arbitrarily
established and is not a ‘true zero’ or ‘absolute zero’.
•Thus the value of zero does not indicate the complete absence of the characteristic being
measured.
•In Fahrenheit/Celsius temperature scales, 0°F and 0°C do not indicate an absence of
temperature.
•In fact, negative values of temperature do exist.
•Temperature, calendar years, attitudes, opinions and so on fall under the interval scale.
Likert scale, Net Promoter Score (NPS), Bipolar matrix table, Semantic differential scale
are the widely used interval scale examples

PART 1: STATISTICAL LEVELS OF
MEASUREMENT
Interval-level Measurement
•Outcomes can be arranged in order, but differences
between data values may now have meaning. T
•wo data points are often used to compare the passing
of time or changing conditions within a data set.
•There is often no “starting point” for the range of data
values, and calendar dates or temperatures may not
have a meaningful intrinsic zero value.

SCALES OF MEASUREMENT
Interval Scale:
•The major difference between ordinal and interval scale is the existence of
meaningful and equal intervals between variables.
•For example, 40 degrees is higher than 30 degrees, and the difference between
them is a measurable 10 degrees, as is the difference between 90 and 100
degrees.
•However, while ranking students on an ordinal scale, the difference between first
and second student might be 5 marks, and between second and third student is 8
marks.
•Thus, with an interval scale, it is possible to identify whether a given attribute is
higher or lower than another and the extent to which one is higher or lower than
another.

SCALES OF MEASUREMENT
Interval Scale:
•The interval scale is the third level of measurement scale. The arbitrary presence of zero has implications
in data manipulation and analysis.
•It is possible to add or subtract a constant to all of the interval scale values without affecting the form of
the scale but not possible to multiply or divide the values.
•For instance, two persons with scale positions 4 and 5 are as far apart as persons with scale positions 9
and 10, but not that a person with score a 10 feels twice as strong as one with a score 5.
•Similarly, 100°F cannot be defined as twice as hot as 50°F because the corresponding temperatures on
the centigrade scale, 37.78°C and 10°C, are not in the ratio 2:1.
•Unlike the ordinal and nominal scale, arithmetic operations such as addition and subtraction can be
performed on an interval scale.
•Any positive linear transformation of form Y = a + bX will preserve the properties of an interval scale
•The arithmetic mean, median, and mode can be used to calculate the central tendency in this scale.
•The measures of dispersion, such as range and standard deviation, can also be calculated.
•Apart from those techniques, product-moment correlation, t-test, and regression analysis are extensively
used for analyzing interval data.

PART 1: STATISTICAL LEVELS OF
MEASUREMENT
Interval-level Measurement
•Outcomes can be arranged in order, but differences
between data values may now have meaning. T
•wo data points are often used to compare the passing
of time or changing conditions within a data set.
•There is often no “starting point” for the range of data
values, and calendar dates or temperatures may not
have a meaningful intrinsic zero value.

SCALES OF MEASUREMENT
Ratio Scale
•Can be defined as a quantitative scale that bears all the characteristics of an interval scale and
a ‘true zero’ or ‘absolute zero’, which implies the complete absence of the attribute being
measured.
•Thus it measures variables that can be labelled, ordered, has equal intervals and the ‘absolute
zero’ property.
•Before deciding to use a ratio scale, the researcher must observe whether the variables possess
all these characteristics.
•The variables such as length, age, weight, income, years of schooling, price etc., are examples of
a ratio scale.
•They do not have negative numbers because of the existence of an absolute zero point of origin.
•For instance, a price of zero means the commodity does not have any price (it is free); and there
cannot be any negative price.
•Thus ratio scale has a meaningful zero.
•It allows unit conversions like metresto feet, kilogram to calories etc.

SCALES OF MEASUREMENT
Ratio Scale:
•The ratio scale is the highest level of measurement scale. It is unique to
a congruence or proportionality transformation of form Y = bX.
•The ‘absolute zero’ property allows performing a wide range of
descriptive and inferential statistics on ratio scale variables.
•It is possible to compare both differences in values and the relative
magnitude of values.
•For instance, the difference between 15cm and 20cm is the same as
between 30cm and 35cm, and 30 cm is twice as long as 15 cm.
•Arithmetic operations such as addition, subtraction, multiplication, and
division (ratio) can be performed in ratio scale data

SCALES OF MEASUREMENT
Ratio Scale:
•All statistical operations applicable to nominal, ordinal and
interval scale can be performed on ratio scale data as well.
•Besides, measures of central tendency such as geometric
mean and harmonic mean and all measures of dispersion,
including coefficient of variation, can be determined.
•Parametric tests such as independent sample t-test, paired
sample t-test, ANOVA etc., can also be performed.
•The ratio scale provides unique opportunities for statistical
analysis.

SCALES OF MEASUREMENT
Scale Properties
Nominal Categories
Ordinal Categories Rank
Interval Categories Rank Intervals
Ratio Categories Rank Interval True or absolute
zero

SCALES OF MEASUREMENT

CROSS TABULATION
Body weight
Normal Overweight
Gender Male10 15 25
Female15 10 25

SOURCES OF DATA
Three main sources for demographic and social statistics
•Censuses
•Surveys
•Administrative records.
A population census
•The total process of collecting, compiling, evaluating, analysing and publishing or otherwise
disseminating demographic, economic and social data pertaining, at a specified time, to all persons
in a country or in a well-delimited part of a country.
•The census collects data from each individual and each set of living quarters for the whole country
or area.
•It allows estimates to be produced for small geographic areas and for population subgroups.
•It also provides the base population figures needed to calculate vital rates from civil registration
data, and it supplies the sampling frame for sample surveys.

SOURCES OF DATA
Population census steps
•Securing the required legislation, political support and funding
•Mapping and listing all households
•Planning and printing questionnaires, instruction manuals and procedures
•Planning for shipping census materials
•Recruiting and training census personnel
•Organizing field operations
•Launching publicity campaigns
•Preparing for data processing
•Planning for tabulation

SOURCES OF DATA
Population census data
•Because of the expense and complexity of the census, only the most basic items
are included on the questionnaire for the whole population.
•Choosing these items requires considering the needs of data users; availability
of the information from other data sources; international comparability;
willingness of the respondents to give information; and available resources to
fund the census.
•Many countries carry out a sample enumeration in conjunction with the census.
•This can be a cost-effective way to collect more detailed information on
additional topics from a sample of the population.
•The sample enumeration uses the infrastructure and facilities that are already in
place for the census.

SOURCES OF DATA
Surveys
•A continuing program of intercensal household surveys is useful for
collecting detailed data on social, economic and housing characteristics
that are not appropriate for collection in a full-scale census.
•Household-based surveys are the most flexible type of data collection.
•They can examine most subjects in detail and provide timely information
about emerging issues.
•They increase the ability and add to the experience of in-house technical
and field staff and maintain resources that have already been
developed, such as maps, sampling frame, field operations, infrastructure
and data-processing capability.

SOURCES OF DATA
Surveys
•The many types of household surveys include multi-
subject surveys, specialized surveys, multi-phase surveys
and panel or longitudinal surveys.
•Each type of survey is appropriate for certain kinds of
data-collection needs.
•Household surveys can be costly to undertake, especially
if a country has no ongoing program

SOURCES OF DATA
Administrative records
•Administrative records are statistics compiled from various administrative
processes.
•They include not only the vital events recorded in a civil registration system but
also education statistics from school records; health statistics from hospital
records; employment statistics; and many others.
•The reliability and usefulness of these statistics depend on the completeness of
coverage and the compatibility of concepts, definitions and classifications with
those used in the census.
•Administrative records are often by-products of administrative processes, but
they can also be valuable complementary sources of data for censuses and
surveys.

SOURCES OF DATA
Administrative records
•Birth certificates
•Deathcertificates
•Patient medical records
•Diseaseregistries
•Insurance claim forms
•Billing records
•Public healthdepartment case reports
Tags