Advanced statistics

khang7772000 26,790 views 91 slides May 26, 2013
Slide 1
Slide 1 of 91
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91

About This Presentation

No description available for this slideshow.


Slide Content

Prof. JOY V. LORIN-PICAR
DAVAO DEL NORTE STATE COLLEGE
NEW VISAYAS, PANABO CITY

TOPIC OUTLINE
PART 1
Role of Statistics in Research
Descriptive Statistics
Hands –On Statistical Software
Sample and Population
Sampling Procedures
Sample Size
Hands –On Statistical Software
Inferential Statistics
Hypothesis Testing
Hands –On Statistical Software

TOPIC OUTLINE
PART 2
Choice of Statistical Tests
Defining Independent and Dependent
Variables
Hands –On Statistical Software
Scales of Measurements
How many Samples / Groups are in the Design
PART 3
Parametric Tests
Hands –On Statistical Software
PART 4
Non-Parametric Tests
Hands –On Statistical Software

TOPIC OUTLINE
PART 5
Goodness of Fit
Hands –On Statistical Software
PART 6
Choosing the Correct Statistical Tests
Hands –On Statistical Software
Introduction to Multiple and Non-Linear
Regression
Hands –On Statistical Software

Role of Statistics in Research
Normally use to analyze data
To organize and make sense out of large amount
of data
This is basic to intelligent reading research
article
Has significant contributions in social sciences,
applied sciences and even business and
economics
Statistical researches make inferences about
population characteristics on the basis of one or
more samples that have been studied.

How is Statistics look into ?
1.Descriptive – this gives us information ,
or simple describe the sample we are
studying.
2.Correlational - this enables us to relate
variables and establish relationship
between and among variables which are
useful in making predictions.
3.Inferential – this is going beyond the
sample and make inference on the
population.

Descriptive Statistics
 N - total population/sample size from any given
population
Example
Minutes Spent on the Phone
102124108 86103 82
71104112118 87 95
103116 85122 87 100
105 97107 67 78 125
109 99105 99101 92

Example 2
425430430435435435435435440440
440440440445445445445445450450
450450450450450460460460465465
465470470472475475475480480480
480485490490490500500500500510
510515525525525535549550570570
575575580590600600600600615615

Range, Mean, Median and Mode
The terms mean, median, mode, and range describe
properties of statistical distributions. In statistics, a distribution
is the set of all possible values for terms that represent defined
events. The value of a term, when expressed as a variable, is
called a random variable. There are two major types of
statistical distributions. The first type has a discrete random
variable. This means that every term has a precise, isolated
numerical value. An example of a distribution with a discrete
random variable is the set of results for a test taken by a class in
school. The second major type of distribution has a continuous
random variable. In this situation, a term can acquire any
value within an unbroken interval or span. Such a distribution
is called a probability density function. This is the sort of
function that might, for example, be used by a computer in an
attempt to forecast the path of a weather system.

Mean
The most common expression for the mean of a statistical
distribution with a discrete random variable is the
mathematical average of all the terms. To calculate it,
add up the values of all the terms and then divide by the
number of terms. This expression is also called the
arithmetic mean. There are other expressions for the
mean of a finite set of terms but these forms are rarely
used in statistics. The mean of a statistical distribution
with a continuous random variable, also called the
expected value, is obtained by integrating the product of
the variable with its probability as defined by the
distribution. The expected value is denoted by the
lowercase Greek letter mu (µ).

Median
 The median of a distribution with a discrete random variable
depends on whether the number of terms in the distribution is
even or odd. If the number of terms is odd, then the median is
the value of the term in the middle. This is the value such that
the number of terms having values greater than or equal to it is
the same as the number of terms having values less than or
equal to it. If the number of terms is even, then the median is
the average of the two terms in the middle, such that the
number of terms having values greater than or equal to it is the
same as the number of terms having values less than or equal to
it. The median of a distribution with a continuous random
variable is the value m such that the probability is at least
1/2 (50%) that a randomly chosen point on the function
will be less than or equal to m, and the probability is at
least 1/2 that a randomly chosen point on the function will
be greater than or equal to m.

Mode
The mode of a distribution with a discrete random
variable is the value of the term that occurs the most
often. It is not uncommon for a distribution with a
discrete random variable to have more than one mode,
especially if there are not many terms. This happens when
two or more terms occur with equal frequency, and more
often than any of the others. A distribution with two
modes is called bimodal. A distribution with three modes
is called trimodal. The mode of a distribution with a
continuous random variable is the maximum value of
the function. As with discrete distributions, there may be
more than one mode.

Range
The range of a distribution with a discrete
random variable is the difference between the
maximum value and the minimum value. For a
distribution with a continuous random variable,
the range is the difference between the two
extreme points on the distribution curve,
where the value of the function falls to zero.
For any value outside the range of a distribution,
the value of the function is equal to 0.
The least reliable of the measure and is use
only when one is in a hurry to get a measure
of variability

Variance

Variance

Standard Deviation
The standard deviation formula is very simple: it
is the square root of the variance. It is the most
commonly used measure of spread.
An important attribute of the standard deviation
as a measure of spread is that if the mean and
standard deviation of a normal distribution
are known, it is possible to compute the
percentile rank associated with any given score.

Standard Deviation
In a normal distribution, about 68% of the
scores are within one standard deviation of the
mean and about 95% of the scores are within
two standard deviations of the mean.
The standard deviation has proven to be an
extremely useful measure of spread in part
because it is mathematically tractable. Many
formulas in inferential statistics use the
standard deviation.

Coefficient of Variation

Kurtosis

KURTOSIS - refers to how sharply peaked
a distribution is. A value for kurtosis is included
with the graphical summary:
· Values close to 0 indicate normally peaked
data.
· Negative values indicate a distribution that is
flatter than normal.
· Positive values indicate a distribution with a
sharper than normal peak.

Skewness

Samples and Population
Population – as used in research, refers to all
the members of a particular group.
It is the group of interest to the researcher
This is the group of whom the researcher
would like to generalize the results of a
study

 A target population is the actual population to
whom the researcher would like to generalize
 Accessible population is the population to whom
the researcher is entitled to generalize

SAMPLING
This is the process of selecting the individuals
who will participate in a research study.
Any part of the population of individuals of whom
information is obtained.
A representative sample is a sample that is similar to
the population to whom the researcher is entitled
to generalize

PROBABILITY AND NON-PROBABILITY
SAMPLING
A sampling procedure that gives every element of
the population a (known) nonzero chance of
being selected in the sample is called probability
sampling. Otherwise, the sampling procedure is
called non-probability sampling.
Whenever possible, probability sampling is
used because there is no objective way of
assessing the reliability of inferences under
non-zero probability sampling.

METHODS OF PROBABILITY
SAMPLING
1. simple random sampling
2.systematic sampling
3.stratified sampling
4. cluster sampling
5. two-stage random sampling

Simple Random Sampling
This is a sample selected from
a population in such a manner
that all members of the
population have an equal
chance of being selected

Stratified Random Sampling
Sample selected so that certain
characteristics are represented in
the sample in the same proportion
as they occur in the population

Cluster Random Sample
This is obtained by using
groups as the sampling unit
rather than individuals.

Two-Stage Random Sample
Selects groups randomly and
then chooses individuals
randomly from these groups.

Non-Probability Sampling
1. accidental or convenience
sampling
2. purposive sampling
3. quota sampling
4. snowball or referral sampling
 5. systematic sampling

Systematic Sample
This is obtained by selecting
every nth name in a population

Convenience Sampling
Any group of individuals that
is conveniently available to be
studied

Purposive Sampling
Consist of individuals who
have special qualifications of
some sort or are deemed
representative on the basis of
prior evidence

Quota Sampling
In quota sampling, the population is first
segmented into mutually exclusive sub-groups,
just as in stratified sampling. Then judgment is
used to select the subjects or units from each
segment based on a specified proportion. For
example, an interviewer may be told to sample
200 females and 300 males between the age of
45 and 60. This means that individuals can put
a demand on who they want to sample
(targeting)

Snow ball Sampling
snowball sampling is a technique for developing a
research sample where existing study subjects recruit
future subjects from among their acquaintances. Thus
the sample group appears to grow like a rolling
snowball. As the sample builds up, enough data is
gathered to be useful for research. This sampling
technique is often used in hidden populations which
are difficult for researchers to access; example
populations would be drug users or prostitutes. As
sample members are not selected from a sampling
frame, snowball samples are subject to numerous
biases

General Classification of
Collecting Data
1. Census or complete enumeration-is the
process of gathering information from every unit
in the population.
- not always possible to get timely, accurate and
economical data
-costly, if the number of units in the population is
too large
2. Survey sampling- is the process of obtaining
information from the units in the selected sample.
Advantages: reduced cost, greater speed, greater
scope, and greater accuracy

Sample size
Samples should be as large as a researcher can
obtain with a reasonable expenditure of time and
energy.
As suggested, a minimum number of subjects is 100
for a descriptive study , 50 for a correlational study,
and 30 in each group for experimental and causal-
comparative design
According to Padua , for n parameters, minimum n
could be computed as n >= (p +3) p/2 where p =
parameters , say if p = 4, thus minimum n = 14.

Inferential Statistics
This is a formalized techniques used to make
conclusions about populations based on samples
taken from the populations.

Hypothesis
Hypothesis is defined as the tentative theory or
supposition provisionally adopted to explain certain facts
and to guide in the investigation of others.
A statistical hypothesis is an assertion or statement that
may or may not be true concerning one or more
population.
Example:
1. A leading drug in the treatment of hypertension has an
advertised therapeutic success rate of 83%. A medical
researcher believes he has found a new drug for treating
hypertensive patients that has higher therapeutic success
rate than the leading than the leading drug with fewer side
effect.

The Statistical Hypothesis :
H
O
: The new drug is no better than the old one (p
=0.83)
H
1
: The new drug is better than the old one ( p> 0.83)
Example 2. A social researcher is conducting a study
to determine if the level of women’s participation in
community extension programs of the barangay can
be affected by their educational attainment ,
occupation, income, civil status, and age.

H
O
: The level of women’s participation in community
extension programs is not affected by their
attainment, occupation, income , civil status and age.

H
1
: The level of women’s participation in community
extension programs is affected by their attainment,
occupation, income , civil status and age.
Example 3: A community organizer wants to compare
the three community organizing strategies applied to
cultural minorities in terms of effectiveness.

A. Hypothesis Testing
Steps in Hypothesis Testing
1. Formulate the null hypothesis and
the alternative hypothesis
- this is the statistical hypothesis
which are assumptions or guesses
about the population involved. In
short, these are statements about
the probability distributions of the
populations

Null Hypothesis
This is a hypothesis of “ no effect “.
It is usually formulated for the express
purpose of being rejected, that is, it is the
negation of the point one is trying to
make.
This is the hypothesis that two or more
variables are not related or that two or
more statistics are not significantly
different.

Alternative Hypothesis
This is the operational statement of
the researcher’s hypothesis
The hypothesis derived from the
theory of the investigator and
generally state a specified relationship
between two or more variables or that
two or more statistics significantly
differ.

Two Ways of Stating the
Alternative Hypothesis
1. Predictive - specifies the type of relationship
existing between two or more variables (direct or
indirect) or specifies the direction of the difference
between two or more statistics
2. Non- Predictive - does not specify the type of
relationship or the direction of the difference

C. LEVEL OF SIGNIFICANCE (α)
α is the maximum probability with which we
would be willing to risk Type I Error (The
hypothesis can be inappropriately rejected ).
The error of rejecting a null hypothesis when it
is actually true. Plainly speaking, it occurs
when we are observing a difference when in
truth there is none, thus indicating a test of
poor specificity. An example of this would be if
a test shows that a woman is pregnant when in
reality she is not.

In other words, the level of significance determines
the risk a researcher would be willing to take in his
test.
The choice of alpha is primarily dependent on the
practical application of the result of the study.

Examples of α
.05 (95 % confident of the claim)
.01 (99 % confident of the claim)
 But take note, α is not always .05 or .01. This could
mathematically be computed based from the
formula :
where the variance , no of samples and its
difference are predetermined – Chebychev’s sample
size formula.

D. Defining a Region of Rejection
The region of rejection is a region of
the null sampling distribution. It
consists of a set of possible values which
are so extreme that when the null
hypothesis is true the probability is
small (i.e. equal to alpha) that the
sample we observe will yield a value
which is among them.

E. Collect the data and compute
the value of the test- statistic
F . Collect the data and compute the

value of the test –statistic.
G. State your decision.
H. State your conclusion.

B. Choose an Appropriate Statistical Test for
testing the Null Hypothesis
The choice of a statistical test for the analysis
of your data requires careful and deliberate
judgment.
PRIMARY CONSIDERATIONS:
The choice of a statistical test is dictated by
the questions for which the research is
designed
The level, the distribution , and dispersion of
data also suggest the type of statistical test to
be used

SECONDARY CONSIDERATIONS
The extent of your knowledge in
statistics
Availability of resources in
connection with the computation
and interpretation of data

Choice of Statistical Tests
This is designed to help you
develop a framework for choosing
the correct statistic to test your
hypothesis.
 It begins with a set of questions
you should ask when selecting your
test.
It is followed by demonstrations of
the factors that are important to
consider when choosing your
statistic.

Choice of Statistical Tests
Presented below are four
questions you should ask and
answer when trying to determine
which statistical procedure is most
appropriate to test your
hypothesis.

Choice of Statistical Tests
What are the independent and
dependent variables?
What is the scale of measurement of
the study variables?
How many samples/groups are in
the design?
Have I met the assumptions of the
statistical test selected?

Choice of Statistical Tests
To determine which test should be
used in any given circumstance, we
need to consider the hypothesis that
is being tested, the independent and
dependent variables and their scale of
measurement, the study design, and
the assumptions of the test.

Defining Independent and Dependent
Variables
Before we can begin to choose our
statistical test, we must determine
which is the independent and which is
the dependent variable in our
hypothesis.
Our dependent variable is always the
phenomenon or behavior that we want
to explain or predict.

Defining Independent and Dependent
Variables
The independent variable represents a
predictor or causal variable in the
study.
In any antecedent-consequent
relationship, the antecedent is the
independent variable and the
consequent is the dependent variable.

Defining Independent and Dependent
Variables
With single samples and one dependent
variable, the one-sample Z test, the one-
sample t test, and the chi-square goodness-of-
fit test are the only statistics that can be used.
Students sometimes ask, "but don't you have
population data too, so you have two sets of
data?" Yes and no.
Data have to exist or else the population
parameters are defined. But, the researcher
does not collect these data, they already exist.

Defining Independent and Dependent
Variables
So, if you are collecting data on one sample
and comparing those data to information
that has already been gathered and is
published, then you are conducting a one-
sample test using the one sample/set of
data collected in this study.
For the chi-square goodness-of-fit test, you
can also compare the sample against chance
probabilities

Defining Independent and Dependent
Variables
When we have a single sample and
independent and dependent variables
measured on all subjects, we typically are
testing a hypothesis about the association
between two variables. The statistics that we
have learned to test hypotheses about
association include:
chi-square test of independence
Spearman's r
s
Pearson's r
bivariate regression and multiple regression

Multiple Sample Tests
Studies that refer to repeated measurements or
pairs of subjects typically collect at least two sets
of scores. Studies that refer to specific subgroups
in the population also collect two or more samples
of data. Once you have determined that the
design uses two or more samples or "groups", then
you must determine how many samples or groups
are in the design. Studies that are limited to two
groups use either the chi-square statistic, Mann-
Whitney U, Wilcoxon test, independent means t
test, or the dependent means t test.

If you have three or more groups in the
design, the chi-square statistic, Kruskal-
Wallis H Test, Friedman ANOVA for ranks,
One-way Between-Groups ANOVA, and
Factorial ANOVA depending on the nature
of the relationship between groups. Some of
these tests are designed for dependent or
correlated samples/groups and some are
designed for samples/groups that are
completely independent.

Multiple Sample Tests
Dependent Means
Dependent groups refer to some type of
association or link in the research design
between sets of scores. This usually occurs
in one of three conditions -- repeated
measures, linked selection, or matching.
Repeated measures designs collect data on
subjects using the same measure on at least
two occasions. This often occurs before and
after a treatment or when the same research
subjects are exposed to two different
experimental conditions.

Multiple Sample Tests
When subjects are selected into the study because of
natural "links or associations", we want to analyze the
data together. This would occur in studies of parent-
infant interaction, romantic partners, siblings, or best
friends. In a study of parents and their children, a
parent’s data should be associated with his son's, not
some other child's. Subject matching also produces
dependent data. Suppose that an investigator wanted
to control for socioeconomic differences in research
subjects. She might measure socioeconomic status
and then match on that variable. The scores on the
dependent variable would then be treated as a pair in
the statistical test.

All statistical procedures for dependent or
correlated groups treat the data as linked,
therefore it is very important that you
correctly identify dependent groups
designs. The statistics that can be used for
correlated groups are the McNemar Test
(two samples or times of measurement),
Wilcoxon t Test (two samples), Dependent
Means t Test (two samples), Friedman
ANOVA for Ranks (three or more samples),
Simple Repeated Measures ANOVA (three
or more samples).

Independent Means
When there is no subject overlap across groups, we define
the groups as independent. Tests of gender differences are
a good example of independent groups. We cannot be
both male and female at the same time; the groups are
completely independent. If you want to determine
whether samples are independent or not, ask yourself,
"Can a person be in one group at the same time he or she
is in another?" If the answer is no (can't be in a remedial
education program and a regular classroom at the same
time; can't be a freshman in high school and a sophomore
in high school at the same time), then the groups are
independent.

The statistics that can be used for
independent groups include the chi-
square test of independence (two or
more groups), Mann-Whitney U Test
(two groups), Independent Means t
test (two groups), One-Way Between-
Groups ANOVA (three or more
groups), and Factorial ANOVA (two or
more independent variables).

Scales of Measurements
Once we have identified the independent
and dependent variables, our next step in
choosing a statistical test is to identify the
scale of measurement of the variables.
All of the parametric tests that we have
learned to date require an interval or ratio
scale of measurement for the dependent
variable.

Scales of Measurements
If you are working with a dependent
variable that has a nominal or ordinal
scale of measurement, then you must
choose a nonparametric statistic to
test your hypothesis

How many Samples / Groups are in the
Design
Once you have identified the scale of
measurement of the dependent variable,
you want to determine how many samples
or "groups" are in the study design.
Designs for which one-sample tests (e.g.,
Z test; t test; Pearson and Spearman
correlations; chi-square goodness-of-fit)
are appropriate to collect only one set or
"sample" of data.

How many Samples / Groups are in the
Design
There must be at least two sets of
scores or two "samples" for any
statistic that examines differences
between groups (e.g. , t test for
dependent means; t test for
independent means; one-way ANOVA;
Friedman ANOVA; chi-square test of
independence) .

Parametric Tests
Parametric statistics are used when our
data are measured on interval or ratio
scales of measurement
Tend to need larger samples
Data should fit a particular distribution;
transformed the data into that particular
distribution
Samples are normally drawn randomly
from the population
Follows the assumption of normality –
meaning the data is normally distributed.

Parametric Assumptions
Listed below are the most frequently
encountered assumptions for parametric tests.
Statistical procedures are available for testing
these assumptions.
The Kolmogorov-Smirnov Test is used to
determine how likely it is that a sample came
from a population that is normally distributed.

Parametric Assumptions
The Levene test is used to test the assumption of
equal variances.
If we violate test assumptions, the statistic chosen
cannot be applied. In this circumstance we have
two options:
We can use a data transformation
We can choose a nonparametric statistic
If data transformations are selected, the
transformation must correct the violated assumption.
If successful, the transformation is applied and the
parametric statistic is used for data analysis.

Types of Parametric Tests
Z test
One-way ANOVA
One-Sample t test
Factorial ANOVA
t test for dependent means
Pearson’s r
t test for independent means
Bivariate/Multiple regression

Non-Parametric Tests
Inference procedures which are likely
distribution free.
Nonparametric statistics are used when our
data are measured on a nominal or ordinal
scale of measurement.
All other nonparametric statistics are
appropriate when data are measured on an
ordinal scale of measurement.
Example to this is the sign tests. These are
tests designed to draw inferences about
medians.

Types of Non-parametric Tests
Signed Tests
Chi-square statistics and their
modifications (e.g., McNemar Test) are
used for nominal data.
Wilcoxon Test – alternative to t – test in
the parametric test
Kruskal- Wallis Test - alternative to
ANOVA
Freidman Test – alternative to ANOVA

Goodness of Fit Test

Choosing the Correct Statistical
Tests
Summary
Five issues must be considered when
choosing statistical tests.
Scale of measurement
Number of samples/groups
Nature of the relationship between
groups
Number of variables
Assumptions of statistical tests

Introduction to Multiple and Non-
Linear Regression

Hands –On Statistical Software

Thank you very much!
Hope you are now
ready to conduct your
study
Tags