00 - Lecture - 02_MVA - Major Statistical Techniques.pdf

JayantChandrapal 78 views 52 slides Jun 23, 2024
Slide 1
Slide 1 of 52
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52

About This Presentation

Multivariate statistics an overview is useful to understand its importance in Business Research


Slide Content

Dr. J D Chandrapal
MBA(Marketing)
PGDHRM
Ph D CII (Award)
London
MBA

(Marketing)
,
PGDHRM
,
Ph
.
D
,
CII

(Award)

London

Development Officer - LIC of India –Ahmedabad - 9825070933

Major Statistical Techniques in Data analysis Major Statistical Techniques in Data analysis
One Variable Two Variables Three Variables
Univariate
Analysis
Bivariate
Analysis
Multivariate
Analysis
Summary Statistics
Analyze relationships
Scatter plots
Dependence Technique
Discrimination Analysis
Central Tendency
Measure of Dispersion
Scatter

plots
Correlation Coefficient
Regression Analysis
Discrimination
Analysis
Canonical Correlation
Regression
MANOVA
Frequency Distribution
Tables
Graphs
Measuring Difference
t test
One way ANOVA
Interdependence Tech.
Factor Analysis
Cluster Analysis
Mltidi i l
S lli
Charts
M
u
ltidi
mens
i
ona
l

S
ca
lli
ng

One Variable Univariate One Variable Univariate
Analysis

Univariate Data Analysis Univariate Data Analysis
One variable is analyzed at a time One

variable

is

analyzed

at

a

time
Doesn’t deal with causes or relationshi
p
s
p
Describe patterns - Summary Statistics
Display data – Frequency Distribution, Charts
The simplest form of analyzing data

Classification of Univariate Technique Classification of Univariate Technique
Univariate Technique
Metric Data
(Interval & Ratio)
Non-Metric Data (Nominal & Ordinal)
One Sample
Two/More Sample
One Sample
Two/More Sample
•Frequency
•Chi-Square
•t Test
•z Test
Independent
Related
Independent
Related
•K-S & Binomial •Chi-Square
•Mann-Whitney
•Median
•Sign
•Wilcoxon
•McNemar
•Two Group
•t Test

z Test
•Pairedt Test
•K-S
•K-W ANOVA
•Chi-Square
z

Test
•One-way ANOVA

Descriptive Statistics Descriptive Statistics
Descriptive statistics refers to the transformation of raw data into a for m
that will make them easy to understand & interpret; rearranging, ordering,
ilti
dt
t
id
diti
if ti
•Describing responses or observations is typically the first form of man
i
pu
l
a
ti
ng
d
a
t
a
t
oprov
id
e
d
escr
i
p
ti
ve
i
n
f
orma
ti
on.
analysis.
•Calculating averages, frequency distributions, cross tabulation are most
common ways of summarizing data.
•Tabulation refers to the orderly arrangement of data in a table or other
summary format.
•The three types
of tabulations are Simple
tabulations, Frequency
tabulations and Contingency
tabulations.

Frequency Distribution Frequency Distribution
Frequency distribution is a representation, either in a graphical or tabu lar
format, that displays d number of observations within a given interval.
•To describe situations, draw conclusions, or make inferences about eventsthe
researcher must organize data in some meaningful way.
•A frequency distribution is the organization of raw data in table form, using
classes and frequencies. It can also be presented in the form of a histogramor
a
bar
chart
a
bar
chart
.
•A frequency distribution provides a visual representation for the distri bution of a
p
articula
r
variable. It dis
p
la
y
sthefre
q
uenc
y
of various outcomes in a sam
p
le.
p
py
qy
p
•The three types
of frequency distributions are Categorical
Frequency
Distribution, Grouped
Frequency Distribution and Cumulative
Frequency
Distribution.

Example of Categorical Example of Categorical
f
Twenty-five army inductees were given a blood test to determine their
blood type. The data set is
A
B
B
AB
O
A
B
B
AB
O
OOBABB
BBOAO
AOOO
A
B
Construct a frequency 
distribution for the 
ABAOBA
Class Tally Frequency Percent
data.
A IIII
520
B IIII
II 7 18
O IIII
IIII 9 36
AB IIII 4 16
Total 25 100
% =
f / n
*  
100
n
= ∑
f

Example of Grouped Example of Grouped
f
The ages of the Top 50 wealthiest people in the world. Organize a
frequency distribution in 8 Class
49 57 38 73 81 74 59 76 65 69 54 56 69 68 78 65 85 49 69 61 48 81 68 37 43 78 82 43 64 67 52 56 81 77 79 85 40 85 59 80 60 71 57 61 69 61 83 90 87 74
Class limits Class Boundaries Tally Frequency
35–41 34.5-41.5 III 3
42–48 41.5-48.5 III 3
49–55 48.5-55.5 IIII 4
56–62 55.5-62.5 IIII IIII 10
63–69 62.5-69.5 IIII IIII 10
70–76 69.5-76.5 IIII 5
77
83
76 5
83 5
IIII
IIII
10
77

83
76
.
5
-
83
.
5
IIII

IIII
10
84–90 83.5-90.5 IIII 5

Example of Cumulative Example of Cumulative
f
Cumulative frequencies are used to show how many data values are
accumulated up to and including a specific class.
The values are found by adding the frequencies of the classes less than or equal to the upper class boundary of a specific class. This gives an ascending cumulative frequency.
Class limits Frequency
Less than 99.5 0
In this Example,
Less than 104.5 2
Less than 109.5 10
Less than 114.5 28
28 of the total record high
temperatures are less
than or e
q
ual to 114F.
Less than 119.5 41
Less than 124.5 48
Less than 129 5
49
q
48 of the total record high
temperatures are less
Less

than

129
.
5
49
Less than 134.5 50
than or equal to 124F.

Visual Representation Visual Representation
It is easier for most people to comprehend the meaning of data presented
graphically than data presented numerically in tables or frequency
distributions. This is especially true if the users have little or no stati stical
knowledge.
•Statistical graphs can be used to describe the data set or to analyze it.
•They can be used to discuss issue, reinforce a critical point, or
summarize data set; can also be used to discover a trend or pattern in a
situation over a period of time.
•. The three most commonly used graphs in research are
1. The histogram
2
The
frequency
polygon
2
.
The
frequency
polygon
.
3. The cumulative frequency graph
, or ogive
(pronounced o-jive).

Histogram Histogram
The histogram is a graph that displays the data by using contiguous
vertical bars (unless the frequency of a class is 0) of various heights to
Histo
g
ram for A
g
e
g
rou
p
wise No. of Wealthiest Peo
p
le
represent the frequencies of the classes.
8
10
12
est People
gggp p
4 68
No of Wealthi
0 2
34.5-41.5 41.5-48.5 48.5-55.5 55.5-62.5 62.5-69.5 69.5-76.5 76.5-83.5 83.5-90.5
Age Class Boundaries

Frequency Polygon Frequency Polygon
The frequency polygon is a graph that displays the data by using lines
that connect points plotted for the frequencies at the midpoints of the
Frequency Polygon Age group wise No. of Wealthiest People
classes. The frequencies are represented by the heights of the points.
8
10
12
4 68
38 45 52 59 66 73 80 87
0
2
Frequency 3 3 4 10 10 5 10 5

Ogive Ogive –– Cumulative Frequency Cumulative Frequency --
f
(c)
The
f
(c) is the sum of the frequencies accumulated up to the upper
boundary of a class in the distribution. The ogive is a graph that
60
represents
f
(c) for the classes in a frequency distribution.
40
50
cy
20 30
Frequenc
Frequency
0
10
Less than
99 5
Less than
104 5
Less than
109 5
Less than
114 5
Less than
119 5
Less than
124 5
Less than
129 5
Less than
134 5
99
.
5

104
.
5

109
.
5

114
.
5

119
.
5

124
.
5

129
.
5

134
.
5

Temperatures F

Distribution Shapes Distribution Shapes
Distribution can have many shapes,. Several of most common shapes are
Bell ShapedUniform J ShapedReverse J
Shaped Shaped
Right Skewed Left Skewed Bimodal U Shaped

Other Types of Graphs Other Types of Graphs
In addition to the histogram
, the frequency polygon
, and the ogive
, several
other types of graphs are often used in statistics. They are the bar
•Bar Graph
‐When the data are qualitative or categorical, bar graphs can be used to
graph,
Pareto chart,
time series graph,
and pie graph
.
represent the data by using vertical or horizontal bars whose heights or lengths
representthefrequenciesof the data.

Pareto
chart

used
to
represent
a
frequency
distribution
for
a
categorical
Pareto
chart
used
to
represent
a
frequency
distribution
for
a
categorical
variable, and the frequencies are displayed by the heights of vertical bars, which
arearrangedinorderfrom highestto lowest.
•Timeseriesgraph
‐representsdatathat occurovera specificperiodof time.
•Pie graph
‐The purpose of the pie graph is to show the relationship of the parts to
the
whole
by
visually
comparing
the
sizes
of
the
sections
.
Percentages
or
the
whole
by
visually
comparing
the
sizes
of
the
sections
.
Percentages
or
proportionscanbeused.Thevariableisnominalor categorical.

Bar Bar –– Pareto Pareto –– Time Series Time Series ––Pie Graph Pie Graph
No of Employees in a LIC Branch
14%
%
Bar Graph
Piegraph
12
10
7
HGA
DO
AO/AAO
42%
20%
Asst HGA D
O
21
12
0 5 10 15 20 25
Asst
HGA
24%
O
AO/AAO
80
100
120
Timeseriesgraph
46
42
38
3035
40
45
50
Pareto Chart
20
40
6080
28
22
05
10
15
20
25
30
0
Qtr - 1 Qtr - 2 Qtr - 3 Qtr - 4
2016
2017
0
Nagpur Jaipur Ahmedabad Banglore Shrinagar
City wise temperature in month of may

Stem and Leaf Plot Stem and Leaf Plot
The stem and leaf plot is a method of organizing data and is a
combination of sorting and graphing. It has advantage over a grouped of
•A stem and leaf plot is a data plot that uses part of the data value as the stem and
part of the data value as the leaf to form groups or classes
retaining the actual data while showing them in graphical form.
part

of

the

data

value

as

the

leaf

to

form

groups

or

classes
.
Data 25 31 20 32 13 14 43 02 57 23 36 32 33
32 44 32 52 44 51 45
Trailing digit
(leaf)
Leading digit
(stem)
Step 1 stem and leaf plot should be arranged in order 02, 13, 14, 20, 23, 25, 31, 32, 32, 32, 32, 33
,
36
,
43
,
44
,
44
,
45
,
5
1
,
5
2
,
57
(leaf)
2
34
(stem)
0 1
33
,
36
,
43
,
44
,
44
,
45
,
5
,
5
,
57
Step 2 A display can be made by using the leading
digit as the stem and the trailing digit as the
035
1 222236
3
4
4
5
2 3 4
lea
f
. For example, for the value 32, the leading
digit, 3, is the stem and the trailing digit, 2,
3
4
4
5
127
4 5

Descriptive Statistics Associated with Descriptive Statistics Associated with
f f
A Frequency table is easy to read and provides basic information, but
sometimes this information may be too detailed and the researcher must
•Descriptive statisticsare briefdescriptivecoefficients that summarize a summarize it by the use of descriptive statistics.
given data set, which can be eithe
r
a representation of the entire population o
r
a sample of it.
•Descriptive statisticsare broken down into measures of central tendency
and measures of variability
, or spread.
•Inferential statisticsare a function of the sample data that assists you to
draw inferences and predictions regarding an hypothesis about a populati on
parameter parameter
.
• Classicinferential statisticsinclude z, t,χ
2
, F-ratio, etc..

Measures of Central Tendency Measures of Central Tendency
A measure of central tendency (also referred to as measures of
centre or central location) is a summary measure that attempts to
describe a whole set of data with a single value that represents the middle or centre of its distribution. There are three main measures of central tendency: the mode, the median and the mean. Each of these measures describes a different indication of th
til
tl
l
i
th
di t ib ti
Measure Definition Symbol(s) Mean
Sum of values divided by total number of values
μ

th
e
t
yp
i
ca
l
or cen
t
ra
l
va
l
ue
i
n
th
e
di
s
t
r
ib
u
ti
on.
Mean
Sum

of

values
,
divided

by

total

number

of

values
μ
,

Median Middle point in data set that has been ordered MD
Mode Most frequency data value Z
Midrange Lowest value plus highest value, divided by 2 MR

The Mean The Mean –– an Example an Example
The data represent the number of
miles run during one week for a
l
f
20
1. The mean varies less than
median or mode, when samples
Class
Frequency 
(f )
Mid Point 
(Xm)
f  * Xm
samp
l
eo
f
20
runners.
are taken from same population
2. Used in computing other
statistics
,
such as variance.
5.5‐10.5 1 8 8
10.5‐15.5 2 13 26
15.5‐20.5 3 18 54
,
3. Mean for data set is unique and
not necessarily one of the data
values
20.5‐25.5 5 23 115
25.5‐30.5 4 28 112
30.5‐35.5 3 33 99
35 5
40 5
2
38
76
values
.
4. Mean cannot be computed for
the open-ended class.
35
.
5

40
.
5
2
38
76
20∑f * Xm490
Σ
f
*
X
m
490
5. Affected by extremely high or
low values, called outliers, &
may
not
be
appropriate
average
X =
Σ
f

X
m
n
490
20
= = 24.5 Miles
may
not
be
appropriate
average
to use in these situations.

The Median The Median –– an Example an Example
Determine median from the following Marks in Marketing Research:
31, 29, 27, 33, 35, 41, 39, 41, 43, 45, 47.
Athdti diddid
n
+1
11 + 1
A
rrange
th
e
d
a
t
a
i
n ascen
di
ng or
d
escen
di
ng or
d
er
27, 29, 31, 33, 35, 39, 41, 41, 43, 45, 47.
39
M =
n
+

1
2
11

+

1
2
==6
th
Value
=39
1. Median is used to find the centre or middle value of a data set.
2. The median is used when it is necessary to find out whether the data
values
fall
into
the
upper
half
or
lower
half
of
the
distribution
values
fall
into
the
upper
half
or
lower
half
of
the
distribution
.
3. Median is used for an open-ended distribution.
4. Median is less affected by outliers & skewed data than mean, and is
usually preferred when the distribution is not symmetrical.

The Mode The Mode –– an Example an Example
themodeisthenumberthatoccurs most often
in a set of data. A
1. The mode is used when the most
t
yp
ical case is desired. The mode
number that appears most often is
the mode.
yp
is the easiest average to compute.
2. It can be used when the data are
Data Set:
3, 7, 5, 13, 20, 23, 39,
23, 40, 23, 14, 12, 56, 23, 29
nominal, such as religious
preference, gender.
3. In a data set the value occurs with
In order these numbers are:
3, 5, 7, 12, 13, 14, 20,
the greatest frequency ; if one
value occurs is said to be uni-
dl
if
l
i
id
23, 23, 23, 23,
29, 39, 40, 56
which numbers appear most often
:
mo
d
a
l
,
if
two va
l
ue occurs
i
ssa
id
to be Bimodal
and if more than two
values occurs is said to be
In this case the mode is 23
.
Multimodal

Measures of Variances Measures of Variances
In statistics, to describe the data set accurately, statisticians must kn ow
more than the measures of central tendency. Measures of variability
Example:
means a statistic that indicates the distribution’s dispersion. 1. An average is an attempt to summarize a
Set A: 10, 10, 11, 12, 12 Set B: 2, 4, 11, 18, 20 Both
have a mean of 11
set of data using
j
ust one numbe
r
.
A
n
average taken by itself may not always be
very meaningful. We need a statistical
f
th t
th
d
Both

have

a

mean

of

11
.
*
**
*
*
cross-re
f
erence
th
a
t
measures
th
esprea
d
of the data.
2. The measures of variability can be
** * **
Set A
calculated on interval or ratio data.
3. For the spread or variability of a data
set
three
measures
are
commonly
used
:
Set B
set
,
three
measures
are
commonly
used
:
range, variance,andstandard deviation.

Range Range
The range is the highest value minus the lowest value. The symbolRis
used for the range.
Example: Set A: 10, 10, 11, 12, 12 Set B: 2, 4, 11, 18, 20
R=highest value - lowest value
Both have a mean of 11, but there
is a different ranges
Range does not tell how much
other values vary from one
For Set A Range = 12 - 10 = 2
another or from the mean
For Set B Range = 20 - 2 = 18

Measuring Spread Measuring Spread –– Deriving Formulas Deriving Formulas
•One way to think about “Spread” is to examine how far each data value
is from mean. This difference is called “deviation” and is represented
by (x - ẍ)
•When addin
g
them u
p,
the
y
cancel each other out
g
ivin
g
sum o
f
zero
g
p,
y
gg
which is not useful. Therefore to prevent the deviations from cancelling
out they should be squared: (x - ẍ)
2
•In order to average the deviations; first adding them all up, it gives the
sum o
f
s
q
uares re
p
resented b
y
Σ
(
x-ẍ
)
2
,
then divided b
y
“n”. however
q
p
y
(
)
,
y
in order to get a conservative estimate for the sample it should be divided by “n – 1”
•This is how the spread is measured

Measuring Spread Measuring Spread –– Variances (S Variances (S
2 2
) )
Variance is a measure of how spread out a data set is. It is calculated as
average squared deviation of each number from the mean of a data set.
Table 1. Auto Sales in
$
Sales (x) (x -Ẍ)(x -Ẍ)
2
11 2
-
14
196
11
.
2
-
1
.
4
1
.
96
11.9 -0.7 0.49
12.0 -0.6 0.36
12 8
02
004
12
.
8
0
.
2
0
.
04
13.4 0.8 0.64
14.3 1.7 2.89
75.6 6.38

Variances (S Variances (S
2 2
) )
Variance is a measure of how spread out a data set is. It is calculated as
the average squared deviation of each number from mean of a data set.
x (x−ẍ)(x-ẍ)
2
7 7 - 5.8 = 1.2 1.44 6
6
-
58= 02
004
A survey was done on sleeping hours
of students in America. A sample of
10 students was found to be (in hours)
as
follows
:
6
6

-
5
.
8

=

0
.
2
0
.
04
8 8 - 5.8 = 2.2 4.84
4 4 - 5.8 = -1.8 3.24
hours)
as
follows
:
7, 6, 8, 4, 2, 7, 6, 7, 6, 5
2 2 - 5.8 = -3.8 14.44 7 7 - 5.8 = 1.2 1.44 6 6 - 5.8 = 0.2 0.04
ẍ =
Σx
n
58
10
= = 5.8 hours
Table
is
constructed
for
the
7 7 - 5.8 = 1.2 1.44
6 6 - 5.8 = 0.2 0.04
5
5
58=
08
064
Table
is
constructed
for
the
calculation of variance.
27.60
3 067
5
5
-
5
.
8

=
-
0
.
8
0
.
64
58 27.60
9
==
3
.
067

hours

Standard Deviation (S) Standard Deviation (S)
Standard deviation is a number used to tell how measurements for a
group are spread out from the average (mean), or expected value.
1. A low standard deviation means that
most of the numbers are very close to
Formula
the average.
2.
A
high
standard
deviation
means
that
or
2.
A
high
standard
deviation
means
that
the numbers are spread out..
3
St d d
diti
b
lltd
=
3
.
St
an
d
ar
d
d
ev
i
a
ti
on can
b
eca
l
cu
l
a
t
e
d
by taking the square root of the variance
,
which itsel
f
is the avera
g
eo
f
= 1.7513
,
g
the squared differences of the mean.

Measures of Position Measures of Position
In addition to measures of central tendency and measures of
variation, there are measures of position or location.
•They are used to locate the relative
position of a data value in the data set.
These measures include
•For example, if a value is located at
the 80th percentile, it means that 80%
of
values
fall
below
it
in
the
•Standard (z) scores,
P
til
of
values
fall
below
it
in
the
distribution and 20% of the values fall
above it.

P
ercen
til
es,
•Deciles,
•Themedianis the value that
corresponds to the 50th percentile,
since
one
-
half
of
the
values
fall
below
•Quartiles.
since
one
half
of
the
values
fall
below
it & one half of the values fall above it.

Standard ( Standard (
z z
) Scores ) Scores
z-score is a very useful statistic because it (a) allows us to calculate the
probability of a score occurring within our normal distribution (b) enabl es
t
t
th t
f
diff t
l
di t ib ti
us
t
o compare
t
wo scores
th
a
t
are
f
rom
diff
eren
t
norma
l
di
s
t
r
ib
u
ti
ons.
•Standard score is signed number
of
standard
deviations
by
which
the
of
standard
deviations
by
which
the
value of an observation or data point
is above the mean value of what is
being
observed
or
measured
being
observed
or
measured
.
•Observed values above the mean have
positive std scores
. while values below
themeanhavenegativestdscores
.
•zscore represents the number of std
deviations that a data value falls
aboveor belowthemean.

Percentiles Percentiles
Percentiles are position measures used in educational and health-related
fields to indicate position of an individual in a group. Percentiles divid e
data
set
into
100
equal
groups
Percentiles
are
not
same
as
percentages
•If a student gets 72 correct
answers
out
of
a
100
obtains
a
data
set
into
100
equal
groups
.
Percentiles
are
not
same
as
percentages
.
answers
out
of
a
100
,
obtains
a
percentage score of 72.
•There is no indication of her
position
with
respect
to
rest
of
position
with
respect
to
rest
of
the class. He/She could have
scored the highest, the lowest, or
somewhere in between.
•On the other hand, if a raw score
of 72 corresponds to the 64th
percentile
then
he/she
did
better
percentile
,
then
he/she
did
better
than 64% of students in her class

Quartiles Quartiles
Quartiles divide the distribution into four groups, separated byQ1,Q2,Q3. Note
thatQ1 is the same as the 25th percentile;Q2 is the same as the 50th percentile, or
the
median
;
Q
3
corresponds
to
the
75
th
percentile
•Finding Data Values Corresponding toQ1,Q2, andQ3
Step 1
Arrange the data in order from lowest to highest.
the
median
;
Q
3
corresponds
to
the
75
th
percentile
Step 2
Find the median of the data values. This is the value forQ2.
Step 3
Find the median of data values that fall belowQ2. This is value forQ1.
Step 4
Find the median of data values that fall aboveQ2. This is value forQ3.
•In addition to dividing the data set into four groups, quartiles can be used as
a rough measurement of variability.
•Interquartile range (IQR) is defined as the difference betweenQ1andQ3and
is the range of the middle 50% of the data.
•The interquartile range is used to identify outliers
, and it is also used as a
measure
of
variability
in
exploratory
data
analysis
.
measure
of
variability
in
exploratory
data
analysis
.
•An outlier is an extremely high or an extremely low data value

Deciles Deciles
•Note thatD1 corresponds toP10;D2 corresponds toP20; etc.
Deciles divide the distribution into 10 groups. They are denoted by D1, D2, etc.
•Deciles can be found by using formulas given for percentiles.
Deciles are denoted by D1, D2, D3, . . , D9, & they correspond to
P10, P20, P30, . . , P90
Quartiles are denoted by Q1, Q2, Q3 & they correspond to
P25, P50, P75.
The median is the same as P50 or Q2 or D5..
Summary of Position Measures
Measure
Definition
Symbol
Measure
Definition
Symbol
Std (z) score No. of Std.  deviations that a data value is above or below the meanZ
Percentile Position in hundredths that a data value holds in the distributionPn
Dil
Pitiitthth tdtlhldithdi t ib ti
D
D
ec
il
e
P
os
iti
on 
in 
t
en
th

th
a
t
 a 
d
a
t
a va
lue 
h
o
ld

in 
th

di
s
t
r
ib
u
ti
on
D
n
Quartile Position in fourths that a data value holds in the distributionQn

Cross Tabulation Cross Tabulation
Organizing and analysing data by groups, categories, or classes to
facilitate comparisons; a joint Freq. distributions of observations on t wo
t
f
ibl
•Upon the construction of the one-way frequency table, the next logical
step
in
preliminary
data
analysis
is
to
perform
cross
-
tabulation
.
or more se
t
o
f
var
i
a
bl
es.
step
in
preliminary
data
analysis
is
to
perform
cross
tabulation
.
•Cross-tabulation is extremely useful when the analyst wishes to study
relationships among and between variables.
•Purpose of cross-tabulation is to determine whether certain variables
differ when compared among various subgroups of the total sample.
I
ft
tblti
i
ll
th
i
f
f
dt
li
i

I
n
f
ac
t
,cross-
t
a
b
u
l
a
ti
on
i
s norma
ll
y
th
ema
i
n
f
orm o
f
d
a
t
aana
l
ys
i
s
i
n
most marketing research projects.
•Two ke
y
elements o
f
cross-tabulation are how to develo
p
the cross-
y
p
tabulation and how to interpret the outcome.

Cross Tabulation Cross Tabulation
Student
Career Seen as
Prestigious
Student
Gender
Student
Career Seen as
Prestigious
Student
Gender
1 Doctor F 11 Doctor F
2
Si tit
M
12
Milit Offi
M
2
S
c
i
en
ti
s
t
M
12
Milit
ary
Offi
cer
M
3 Military Officer M 13 Scientist F
4 Military Officer F 14 Lawyer F
5 Doctor M 15 Lawyer F 6
Si tit
F
16
Milit Offi
M
6
S
c
i
en
ti
s
t
F
16
Milit
ary
Offi
cer
M
7 Military Officer M 17 Doctor F 8 Athlete F 18 Scientist M
9 Doctor F 19 Doctor F
10
Si tit
M
20
L
M
10
S
c
i
en
ti
s
t
M
20
L
awyer
M
Career
Gender
Total
Female Male
Doctor
5
1
6
Doctor
5
1
6
Scientist235
Military Officer145
Lawyer213
Athlete101
Total11 9 20

Two
Variables
Bivariate Analysis
The Bivariate analysis involves the analysis of two variables, X:
inde
p
endent/ex
p
lanator
y
/outcome variable and
Y
:de
p
endent/outcome
ppy
p
variable, to determine the relationship between them.

Bivariate Data Analysis Bivariate Data Analysis
•Bivariate analysis explores how the dependent (“outcome”) variable
depends or is explained by the independent (“explanatory”) variable or
it explores the association between the two variables without any cause and effect relationship..
•Examples
What
is
the
correlation
between

Volume
of
Sales

and

Profit

?
What
is
the
correlation
between
Volume
of
Sales
and
Profit ?
Bivariate analysis is slightly more analytical than Univariate analysis . In a survey of a classroom, let us check analysis the ratio of students
who scored above 85% corresponding to their genders. In this case,
there
are
two
variables
gender
=
X
(IV)
and
result
=
Y
(DV)
A
Bivariate
there
are
two
variables

gender
=
X
(IV)
and
result
=
Y
(DV)
.
A
Bivariate
analysis is will measure the correlations between the two variables.

Types of Bivariate Correlations Types of Bivariate Correlations
Numerical and Numerical
Both the variables have a numerical value.
Categorical and Categorical
Numerical and Categorical
Both the variables are in static form;
sometimes called a nominal variable
Numerical

and

Categorical
One variable is numerical, and
the other is categorical
Bivariate analyses could be used to answer the question of whether
There is an association between income (Numerical) and Expenditure(Numerical)
There is an association between income (Numerical) and quality of life(Categorical) There

is

an

association

between

income

(Numerical)

and

quality

of

life(Categorical)
There is association b/w Social Status (Categorical) & quality of life(Categorical)

Types of Bivariate Tests Types of Bivariate Tests
Analyze relationships
Chi-square
Scatter plots
Correlation Coefficient Correlation

Coefficient
Regression Analysis
Measuring Difference
t test t

test
One way ANOVA

ChiChi--Square test Square test
•The Chi-square is one of the most useful non-parametric (distribution
free) tool and a statistic for testing hypotheses at Nominal level
•Chi-square is a statistical test used to examine the differences between
categorical variables between two or more independent groups from a
random sam
p
le in order to
j
ud
g
e
g
oodness o
f
fit between ex
p
ected
p
jg
g
p
and observed results in the same population.
•The null hypothesis of the Chi-Square test is that
No relationship exists on the categorical variables in the population; they are independent.
•The Pearson's χ
2
test is the most commonly used test
i i
n
E O
2
2
) (



i
i
E
1



Scatter Plot Scatter Plot
Scatter plots are a graphs that present the relationship between two
variables in a data-set. It represents data points on a two-dimensional
l
Cti
t
Th
IV
tt ib t
i
lttd
th
X
p
l
ane or on a
C
ar
t
es
i
an sys
t
em.
Th
e
IV
or a
tt
r
ib
u
t
e
i
sp
l
o
tt
e
d
on
th
e
X
-
axis, while the DV is plotted on the Y-axis.
Form
Strength
Form
Strength
Is the association linear or
nonlinear?
Association strong,
moderately strong/ weak?
Direction
Outliers
Is the association positive
Data points unusually far
Is

the

association

positive

or negative?
Data

points

unusually

far

away from general pattern?
A scatter plot is also called a scatter chart, scattergram, or scatter plot ,
XY graph. The scatter diagram graphs numerical data pairs, with one
variable on each axis, show their relationship..

Scatter Plot Scatter Plot -- Example Example
+ Correlation
No Correlation
- Correlation

Correlation Correlation

Overview of Statistical Techniques
•Appropriate when there is a single measurement
o
f
each o
f
the 'n' sample objects or there are
several measurements of each of the `n'
observations but each variable is anal
y
zed in
Univariate
Techniques
y
isolation
•A collection of procedures for analyzing
association between two or more sets of
Multivariate
measurements that have been made on each object in one or more samples of objects Dependence
or
interdependence
techniques
Techniques
Dependence
or
interdependence
techniques

Overview of Statistical Techniques
•Appropriate when there is a single measurement
o
f
each o
f
the 'n' sample objects or there are
several measurements of each of the `n'
observations but each variable is anal
y
zed in
Univariate
Techniques
y
isolation
•A collection of procedures for analyzing
association between two or more sets of
Multivariate
measurements that have been made on each object in one or more samples of objects Dependence
or
interdependence
techniques
Techniques
Dependence
or
interdependence
techniques

Multivariate Data Analysis Multivariate Data Analysis
•Multivariate analysis is the analysis of three or more variables. There ar emany
ways to perform multivariate analysis depending on your goals.

More
than
two
variables
are
analyzed
together
for
any
possible
association
or

More
than
two
variables
are
analyzed
together
for
any
possible
association
or
interactions. Example – What is correlation between “Sales Volume”, “Expenditure on promotion” and “Profit”?.
•MV
A
is a more complex form of statistical analysis technique as it would be
required to understand the relationship of each variable with each other
•Commonly used multivariate analysis technique include –
Factor Analysis
Cluster Analysis
Variance Analysis Discriminant
Analysis
Discriminant
Analysis
Multidimensional Scaling Principal Component Analysis Multiple Regression Analysis Canonical Correlation Analysis Canonical

Correlation
Analysis
Conjoint Analysis Structural Equation Modelling

Classification of Multivariate Technique Classification of Multivariate Technique
Multivariate Technique
Dependence Technique
Interdependence Technique
One Dependent 
Variable
Two/More
Dependent Variable
Variable 
Interdependence
Inter‐Object 
Similarity
•MANOVA &

Cluster Analysis
•Factor Anal
y
sis •Cross Tabulation 
MANCOVA
•Canonical
Correlation

Cluster
Analysis
•Multidimensional
Scaling
y
•Chi-Square
•K-S & Binomial
(More than two 
variables)
•ANOVA & ANCOVA
•Structural Equation
Modelling & Path
Analysis
•Multiple Regression
•Two Group 
Discriminant Analysis
•Logit Analysis
•Conjoint Analysis

H
yp
othesis Testin
g
yp g
Researchers are interested in answering many types of questions .These types of
questions can be addressed through statistical hypothesis testing,which is a
decision
making
process
for
evaluating
claims
about
a
population
decision

making
process
for
evaluating
claims
about
a
population
.
•In hypothesis testing, the researcher must
Define the population under study,
State the particular hypotheses that will be investigated, Give the significance level, Select a sample from the population, Collect
the
data
Collect
the
data
,
Perform the calculations required for the statistical test, and
Reach a conclusion.
•H
yp
otheses concernin
g
p
arameters such as means and
p
ro
p
ortions can be investi
g
ated.
yp
g
p
pp
g
•There are two specific statistical tests used for hypotheses concerning means: the ztestand
thettest.
•The hypothesis‐testing procedure along with theztest and thettest. In addition, a
hypothesis‐testing procedure for testing a single variance or standard deviation using the
chi‐square distribution

T
yp
es of H
yp
othesis Testin
g
yp yp g
Researchers are interested in answering many types of questions .These types of
questions can be addressed through statistical hypothesis testing,which is a
decision
making
process
for
evaluating
claims
about
a
population
decision

making
process
for
evaluating
claims
about
a
population
.
•In hypothesis testing, the researcher must
Define the population under study,
Methods of Hypothesis
State the particular hypotheses that will be investigated,
Give the significance level,
Select a sample from the population, Collect
the
data
1. Traditional method
2. P‐value method
Collect
the
data
,
Perform the calculations required for the statistical test,
Reach a conclusion.
•H
yp
otheses concernin
g
p
arameters such as means and
p
ro
p
ortions can be investi
g
ated.
3. Confidence interval method
yp
g
p
pp
g
•There are two specific statistical tests used for hypotheses concerning means: the ztestand
thettest.
•The hypothesis‐testing procedure along with theztest and thettest. In addition, a
hypothesis‐testing procedure for testing a single variance or standard deviation using the
chi‐square distribution

Statistical H
yp
othesis
yp
A statistical hypothesis is a conjecture about a population parameter. This
conjecture may or may not be true.. 
• There aretwo types of statistical hypotheses
for
each situation: the null hypothesis and the
alternative
hypothesis
Two-tailed test
H
0
: µ = k
alternative
hypothesis
.
•Thenull hypothesis
,symbolized byH0
,isa
statistical hypothesis that states that there is no
difference
between
a
parameter
and
a
specific
H
1
: µ ≠ k
Right-tailed test
difference
between
a
parameter
and
a
specific
value, or that there is no difference between two
parameters.
H
0
: µ = k
H
1
: µ > k
•Thealternative hypothesis
,symbolized byH1
,isa
statistical hypothesis that states the existence of a
difference between a parameter and a specific
value
or
states
that
there
is
a
difference
between
Left-tailed test
H
0
: µ = k
H
1
:
µ
< k
value
,
or
states
that
there
is
a
difference
between
two parameters.
µ

State H0 and H1 for each conjecture.
1. A researcher thinks that if expectant mothers
use vitamin pills, the birth weight of the
babies
will
increase
The
average
birth
weight
1. Right-tailed test H
0
:µ=86
babies
will
increase
.
The
average
birth
weight
of the population is 8.6 pounds.
2. An
e
n
g
in
eer
h
ypot
h
es
iz
es
t
h
at
t
h
e
m
ea
n
H
0
:

µ

=

8
.
6
H
1
: µ > 8.6
eg ee
ypot es es
tat
te
ea
number of defects can be decreased in a
manufacturing process of compact disks by
using robots instead of humans for certain
2. Left-tailed test
H
0
: µ = 18
H
1

<
18
tasks. The mean number of defective disks
per 1000 is 18. 3
A
psychologist
feels
that
playing
soft
music
H
1
:

µ

<

18
3. Two-tailed test
3
.
A
psychologist
feels
that
playing
soft
music
during a test will change the results of the test. The psychologist is not sure whether the grades
will
be
higher
or
lower
.
In
the
past,
the
H
0
: µ = 73
H
1
: µ ≠ 73
grades
will
be
higher
or
lower
.
In
the
past,
the
mean of the scores was 73.
Tags