INTRODUCTION TO
DATA ANALYSIS
Rosal Jane G. Ruda-Bayor
WHAT IS DATA ANALYSIS?
Introduction to Data Analysis 2
Data Analysis
Probability and
Statistics
Dataanalysisistheprocessofinspecting,presentingand
reportingdatainawaythatisusefultonon-technical
people.Dataanalysisusesprobabilityandstatistical
toolstoanalyzedatafromasample.
While data analysts observe trends and patterns in a data,
statistics validates those theories using the scientific
process.
Hence, data analysis acts as a translatorbetween numbers
and figures and the people who need to know about them.
INTRODUCTION TO
STATISTICS
STATISTICS DEFINED
Introduction to Data Analysis 4
Statisticsisdefinedasthe“scienceofcollecting,organizing,
summarizing,andanalyzinginformationtodrawaconclusion
oranswerquestions”.Itprovidesameasureofconfidenceinany
conclusion.
Statisticsisalsoaboutwhere
numberscomefromandhowclose
theyreflectreality.
Collection of information
Organization and summary of information
Analysis to draw conclusions
Reports should be based on a measure of
confidence
STATISTICS: DEALING WITH DATA
Introduction to Data Analysis 5
INFORMATION = DATA
Data –is defined as a “fact or proposition used to draw a conclusion or make a decision”. It can be
numerical or nonnumerical. It describes characteristics of an individual.
Data is POWERFUL
Proper data analysis can be used to
disprove unfounded claims.
Data is multidimensional
A good statistical analysis knows how
to deal with lurking variables.
Data is varied
Statistics helps understand variability
and its sources.
“In mathematics, when a problem solved
correctly, the results can be reported with 100%
certainty. In statistics, the results do not have
100% certainty.”
Understanding concepts in probability, statistics
and data analysis will give us the ability to
analyze and criticize information.
SAMPLE AND POPULATION
Introduction to Data Analysis 6
SupposeyouwanttostudythenumberofhoursMSU-IITstudentsspendonsocialmedia
(Facebook,Twitter,Tiktok,etc.).
Youinterviewed150studentsandaskedthemhowmuchtimetheyspendonsocialmediaevery
day.Theresultsindicateameanof4.25hoursperdaywithastandarddeviationof2.88hours.
Thepopulation is the complete collection of subjects
or things in which we are interested. It is the entire
group to be studied.
A sample is a subset of the population. The size of the
population is N, where the size of the sample is
denoted as n, ??????≤??????.
STATISTIC AND PARAMETER
Introduction to Data Analysis 7
A parameteris a characteristic from a
complete collection of subjects or
things in which we are interested.
Parameters are often unknown and may
need to estimated from a statistic.
A statistic is a characteristic from a
subset of the population of interest.
Statistics are often used to estimate
parameter values.
Often Greek letters are used to denote
parameters and “decorated” letters with
a “hat” or a “bar” are statistics.
Statistic = መ??????
sampleproportion =ෝ??????
sample mean =ഥ??????
sample standard deviation =??????
Parameter = ??????
population proportion =??????
populationmean =??????
populationstandard deviation =??????
statisticsparameters
STATISTICAL INFERENCE
Introduction to Data Analysis 8
Statisticalinferenceistheprocessofusingknownsampledinformationtoformaconclusionabout
unknownpopulationcharacteristics.
DESCRIPTIVE AND INFERENTIAL STATISTICS
Introduction to Data Analysis 9
Inferentialstatisticsisextendingtheresultsofyoursampletowardsyourpopulation.Thisgeneralization
containsuncertaintybecauseasamplecannottelluseverythingaboutapopulation.
DESCRIPTIVE INFERENTIAL
Organize, summarize and present the
data in a meaningful manner
Compares, tests, and predicts future
outcomes or makes estimates
Shown through graphs, charts, tablesUtilizes probability scores
Describes data which is already knownTries to make a conclusion about the
population that is beyond the data
Tools: measures of central tendency,
mean/median/mode
Tools: hypothesis test, ANOVA,
goodness-of-fitness test, etc.
CONCEPT CHECK
A survey of 100 individuals ages 18-65 showed that 63%
believe they are poor.
Population:
Sample:
Parameter:
Statistic:
10
Introduction to Data Analysis
PROCESS OF STATISTICS
Introduction to Data Analysis 11
Identify the
research objective
Collect the data
needed
Describe the
data
Perform
inference
Aresearchermustdeterminethequestionsheorshewantanswered.The
questionsmustclearlyidentifythepopulationthatistobestudied.
Conductingthedataonthewholepopulationisimpracticalandexpensive.
However,appropriatedatacollectiontechniquesmustalsobefollowed.
Describethedatacollectedusingnumericalandvisualtools.Itgivesusan
overviewofthedataandcanhelpusdeterminewhichstatisticaltoolstouse
forinference.
Applytheappropriatetechniquestoextendtheresultsobtainedfromthe
sampletothepopulationandreportalevelofreliabilityoftheresults.
QUALITATIVE AND QUANTITATIVE VARIABLES
Introduction to Data Analysis 12
Variables–characteristicsofanindividualwithinthepopulation.
QuantitativeQualitative
Variable Types
•Variables which are non-measurable
characteristics of an individual.
•Classification based on some
attribute or characteristic.
•.
Example: hair color, address, gender,
rating
•Variables whose values result from
counting or measuring something
•Can be discrete or continuous
Example: weight, amount of rain, height,
temperature
Variablesarenotconstantandvary.
DISCRETE AND CONTINUOUS VARIABLE
Introduction to Data Analysis 13
Quantitativevariablescanbefurtherclassifiedintodiscreteorcontinuous.
Continuous Discrete
Quantitative Variable Types
•Has either a finite number of possible
values or a countable number of
possible values.
•Cannot take on every possible value
between any two values
•Value is determined from counting
Example: number of children in a
family, number of students in a class
•Has an infinite number of possible
values that are not countable.
•Can take on every possible value
between any two values
•Value is determined from
measurement
Example: weight, amount of rain, height
LEVELS OF MEASUREMENT
Introduction to Data Analysis 15
Toestablishrelationshipsbetweenvariables,researchersmustobservethevariablesandrecord
theirobservations.Thisrequiresthatthevariablesbemeasured.Theprocessofmeasuringa
variablerequiresasetofcategoriescalledascaleofmeasurementandaprocessthatclassifies
eachindividualintoonecategory.
Levels of Measurement
1.Nominal Scale is an unordered set of categories identified only by name. Nominal measurements
only permit you to determine whether two individuals are the same or different. (Ex. Eye color,
brand)
2.An ordinal scaleis an ordered set of categories. Ordinal measurements tell you the direction of
difference between two individuals. It allows for the values to be arranged or ranked (Ex. Letter grade)
3.An interval scaleis an ordered series of equal-sized categories. Interval measurements identify the
direction and magnitude of a difference. The zero point is located arbitrarily on an interval scale. It
has the properties of ordinal level of measurement but the differences between values have meaning
(Ex. Temperature)
4.A ratio scaleis an interval scale where a value of zero indicates none of the variable. Ratio
measurements identify the direction and magnitude of differences and allow ratio comparisons of
measurements (Ex. Heigh, weight)
CONCEPT CHECK
A survey of 100 individuals ages 18-65 showed that 63%
believe they are poor.
Population:
Sample:
Parameter:
Statistic:
16
Introduction to Data Analysis
OBSERVATIONAL AND
EXPERIMENTAL STUDIES
EXAMPLE
Introduction to Data Analysis 18
Cellular Phones and Brain Tumors
❖In a study by Benson, et. al. (2013), the researchers followed a
sample middle-aged women in the United Kingdom for 7 years.
The researchers compared the women who never used a mobile
phone with those who used one and found no significant
difference in the incident rate of brain tumors between the two
groups.
❖Researchers from the United States National Toxicology Program
conducted a study to address the concern of brain tumor
incidence due to radio-frequency radiation (RFR). Since it is
unethical to purposely expose humans to a potential carcinogen,
rats were used. 90 rats were randomly assigned to three possible
groups: control group, GSM-modulated RFR, CDMA-modulated
RFR. Although brain tumor incidence was found in Group 2 and
3, they were not statistically different from the control group.
OBSERVATIONAL STUDY VS. EXPERIMENT
Introduction to Data Analysis 19
Oncetheresearchobjectiveisdetermined,theresearcherdevelopsamethodincollectingdata.
Basis of Collecting Data
An observational study measures the value of the response variable without attempting to influence
the value of either the response or explanatory variables. That is, in an observational study, the
researcher observes the behavior of the individuals without trying to influence the outcome of the study
If a researcher randomly assigns the individuals in a study to groups, intentionally manipulates the
value of an explanatory variable, controls other explanatory variables at fixed values, and then
records the value of the response variable for each individual, the study is a designed experiment.
EXAMPLE
Introduction to Data Analysis 20
Do Flu Shots Benefit Seniors?
Researchers wanted to determine the long-term benefits of the
influenza vaccine on seniors aged 65 years and older by looking at
records of over 36,000 seniors for 10 years. The seniors were divided
into two groups. Group 1 were seniors who chose to get a flu
vaccination shot, and group 2 were seniors who chose not to get
a flu vaccination shot. After observing the seniors for 10 years, it was
determined that seniors who get flu shots are 27% less likely to be
hospitalized for pneumonia or influenza and 48% less likely to die
from pneumonia or influenza.
Source: Kristin L. Nichol, MD, MPH, MBA, James D. Nordin, MD, MPH, David B. Nelson, PhD, John P.
Mullooly, PhD, EelkoHak, PhD. “Effectiveness of Influenza Vaccine in the Community-Dwelling
Elderly,” New England Journal of Medicine 357:1373–1381, 2007.
WHICH IS BETTER? OBSERVATIONAL OR EXPERIMENT
Introduction to Data Analysis 21
Severalfactorsmayhavecontributedtotheresultsoftheresponsevariable.
Confounding in a study occurs when the effects of two or more explanatory variables are not
separated. Therefore, any relation that may exist between an explanatory variable and the response
variable may be due to some other variable or variables not accounted for in the study.
InourExample,otherfactorssuchaslowerhospitalizationordeathratescanbecausedbyotherfactors
asidefromtheflushot.Itcouldrace,gender,etc.
Confounding is often caused by a lurking variable. A lurking variable is an explanatory variable that
was not considered in a study, but that affects the value of the responsevariable in the study. In
addition, lurking variables are typically related to explanatory variables considered in the study.
Observational studies do not allow for a research to claim causation, only association.
WHICH IS BETTER? OBSERVATIONAL OR EXPERIMENT
Introduction to Data Analysis 22
Designedexperimentsareusedwhenevercontrolofcertainvariablesispossibleanddesirable.
Thistypeofresearchallowsresearcherstoidentifycertaincauseandeffectrelationships
amongvariablesinthestudy.
A confounding variable is a an explanatory variable that was considered in a study whose effect cannot
be distinguished from a second explanatory variable in the study.
Reasons why observational studies are conducted over designed experiments:
•Ethics
•Greater timeliness, lower cost and broader range of patients
Main difference between lurking and confounding variable:
•Lurking variables are not considered in the study
•Confounding variables are considered but may have an effect with other explanatory
variables or the response variable
TYPES OF OBSERVATIONAL STUDIES
Introduction to Data Analysis 23
Cross-sectional study –a type of study which collect information about individuals at a specific point
in time or over a very short period of time.
Case-control study –These studies are retrospective, meaning that they require individuals to look
back in time or require the researcher to look at existing records. In case-control studies, individuals
who have a certain characteristic may be matched with those who do not.
Disadvantage: accuracy of information being recalled, truthfulness
Cohort study –A cohort study first identifies a group of individuals to participate in the study (the
cohort). The cohort is then observed over a long period of time. During this period, characteristics
about the individuals are recorded and some individuals will be exposed to certain factors (not
intentionally) and others will not. At the end of the study the value of the response variable is recorded
for the individuals. It is prospective in nature.
Disadvantage: individuals may not continue, expensive
Census –list of all individuals in a population along with certain characteristics of each individual.
SAMPLING METHODS
SAMPLING
Introduction to Data Analysis 25
Randomsamplingistheprocessofusingchancetoselectindividualsfromapopulationtobe
includedinthesample.
If convenience is used to obtain a sample, the results of a survey is meaningless.
Simple
random
sampling
Stratified
sampling
Systematic
sampling
Cluster
Sampling
Every possible sample of size n
has an equally likely chance of
occurring.
Obtained by selecting every kth
individual from the population.
The first individual selected
corresponds to a random
number between 1 and k.
Obtainedbyseparatingthe
populationintononoverlapping
groupscalledstrata,then
obtainingarandomsamplefrom
eachstrata.
Obtained by selecting all
individuals within a randomly
selected collection or group of
individuals.
SYSTEMATIC SAMPLING
Introduction to Data Analysis 26
1.Approximate the population size N
2.Determine the sample size desired, n.
3.Compute N/n and round down to the nearest integer. This value is
k.
4.Randomly select a number between 1 and k. Call this number p.
5.The sample will consist of the following individuals:
p, p+k, p+2k, …. , p+(n-1)k
CLUSTER SAMPLING
Introduction to Data Analysis 27
Important questions to ask in cluster sampling
1.How do I cluster a population?
2.How many clusters do I sample?
3.How many individuals should be in each cluster?
Ifclustersarehomogenous,itisbettertohavemoreclusters
withfewerindividualsineachcluster.
Heterogenousclusterslikelyresembletheheterogeneityofthe
population.
BIAS IN SAMPLING
Introduction to Data Analysis 28
Iftheresultsofthesamplearenotrepresentativeofthepopulation,thenthesampleisbias.
SourcesofBiasinSampling
1.SamplingBias–meansthatthetechniqueusedtoobtaintheindividualsinthesampletendstofavoronepartofthe
populationoveranother.Thisresultsinundercoverage,whichoccurswhentheproportiononesegmentofthe
populationislowerinasamplethaninapopulation.
2.NonresponseBias–existswhenindividualsselectedtobeinthesamplewhodonotrespondtothesurveyhave
differentopinionsfromthosewhodo.Thishappensifindividualsselecteddonotrespondorcannotbecontacted.
Callbacksandrewardscanbeusedtocounternon-response.
3.ResponseBias-existswhentheanswersonasurveydonotreflectthetruefeelingsoftherespondent.
a)InterviewError–trainedinterviewerscanhelprespondentsbetruthful
b)Misrepresentedanswers–somequestionsmayresultinmisrepresentation(surveyofsalary,etc.)
c)Wordingofquestions--askingquestionsinabalanceform,veryvaguequestions
d)Orderofquestion–priorquestionsmayaffectthewayrespondentsanswerfollowingquestions
e)Typeofquestion–openvs.closequestions
f)Dataentryerror
DESIGNOFEXPERIMENT S
CHARACTERISTICS OF AN EXPERIMENT
Introduction to Data Analysis 30
Anexperimentisacontrolledstudyconductedtodeterminetheeffectvaryingoneormore
explanatoryvariablesorfactorshasonaresponsevariable.Anycombinationofthevaluesofthefactors
iscalledatreatment.
A factoris a
characteristic that
differentiates each
group or
population. A
factor can have
two or more
levels.
The treatment is
a combination of
factors and/or
levels of factors.
Treatment
combinations are
applied to the
experimental
units.
The responseis
the measured
outcome taken
from the
experimental
units.
Acontrolgroupservesasabaselinetreatmentthatcanbeusedtocompareittoothertreatments.
Replication -Replicationis the repetition of an experiment on more than one individual.
Blinding -Blinding is a technique in which the subject doesn’t know whether he or she is receiving a
treatment or a placebo to avoid bias.
Double-blinding–both researcher and subject does not know which one gets the placebo
Presentation title 31
PRINCIPLES OF EXPERIMENTAL
DESIGN
Replicate Control Randomize
•Replicate experimental
units in each treatment
group to estimate
variability.
•More experimental units
reduce chance variability.
•Replicate overall
experiment to validate
results.
•Use chance to assign
experimental units to
treatments.
•Reduces bias due to
unknown sources of
variation.
•Minimize external
sources of variation
among experimental
units such that the only
source of variation is the
treatment.
•Compare two or more
treatments to better
understand an effect.
If the experiment concludes there are differences among treatment groups then the
results may be referred to as statistically significant. Statistical Significance is
when the observed effect so large it would rarely occur by chance.
EXAMPLE
•A manufacturer of a coating formulation wants to know the effect of using a
coating on the corrosion rate of metal roofing.
Identify the following for the above study:
•Factor and Level
•treatment
•experimental units
•response
•Coating
•Level: Coating, No coating
Factor
•With Coating
•Without Coating
Treatment
•Metal roofing
Experimental
Units
•Corrosion rateResponse
No CoatingWith Coating
Treatment 1Treatment 2
What possible
confounding
variable can you
think of that may
affect the results of
the study?
How about HUMIDITY?
EXAMPLE
•A manufacturer of a coating formulation wants to know the effect of using a coating on the corrosion rate
of metal roofing. In order to account for humidity, the metal roofs were also subject to atmosphere with
20% humidity and 80% humidity.
•Coating
•Level: Coating, No coating
Factor 1
•Humidity
•Level: 20%, 80%
Factor 2
•Metal roofing
Experimental
Units
•Corrosion rateResponse
No Coating With Coating
20% Humidity Treatment 1 Treatment 2
80% Humidity Treatment 3 Treatment 4
COMPLETELY RANDOMIZED SAMPLING
Introduction to Data Analysis 34
Acompletelyrandomizeddesignisoneinwhicheachexperimentalunitisrandomlyassignedtoa
treatment.
COMPLETELY RANDOMIZED SAMPLING
Introduction to Data Analysis 35
Acompletelyrandomizedblockdesignisusedwhenunitsshareanobservedcharacteristicthatmay
introduceunwantedvariation.Thehomogenousunitsaregroupedintoblocksbasedonunavoidable
characteristic.Completelyrandomizedexperimentsareconductedwithintheblocks.Example:Testing
differentbrands
Treatments must be randomly assigned within the block to avoid
confounded variables. If variables are confounded, their treatment effects
cannot be distinguished from each other.
MATCHED-PAIR DESIGN
Introduction to Data Analysis 36
Amatched-pairsdesignisanexperimentaldesigninwhichtheexperimentalunitsarepairedup.The
pairsareselectedsothattheyarerelatedinsomeway(thatis,thesamepersonbeforeandaftera
treatment,twins,husbandandwife,samegeographicallocation,andsoon).Thereareonlytwolevelsof
treatmentinamatched-pairsdesign.
EXAMPLE
Aneducationalpsychologistwantstodeterminewhetherlisteningtomusichasaneffectonastudent’s
abilitytolearn.Designanexperimenttohelpthepsychologistanswerthequestion.
Approach:Useamatched-pairsdesignbymatchingstudentsaccordingtoIQandgender(justincase
genderplaysaroleinlearningwithmusic).