Research needs good understanding of data analysis
Vikash Raj Satyal
([email protected])
Summarize your Data:
What to look in the dataset?
If our study have a large data set, we
(researcher) are interested to know :-
•What the central value is,
•What is the spread from center,
•What is the shape & size of data
distribution
Major economic dataset
Questions
•What is percapita GDP?
•Whose percapita GDP is this?
•Did you earn $1191 in this FY
142920(Rs.126,018)? (Rs11,910monthly)
•Nepali people earn about
55 times low percapita GDP
than USA, and
165 times lower than
Monacopeople
Nepali Database
Research Paradigm
3. Survey(Collect data)
4. Statistical analysis
5. There is not enough evidence to
support research(alternative)
hypothesis(H
A)
6. Res. Hypo accepted
H
Ais true
= Failure of research hypo.
7. Report writing
1. Setup research
hypo/Refine(Lit Review)
2. Develop instruments
5a. Report writing
7
Why Dolpa&
Mugualso have
highest annual
growth rate?
Why Achham,
Palpahas one of
the lowest
growth rate?
Mugu
Dolpa
•What is the general IQ of US university students?
•In the US the mean IQ for persons completing no more than a…..
•Bachelor’s degree 113 (80th centile)
•Master’s degree 117 (87th centile)
•PhD, LLD, MD 124 (95th centile)
Central Tendency in large sample data
In any large data set, data are
clustered around center. So
researchers focus to find out
that central value.
Depending on the shape of the
data distribution center is
calculated differently using
different statistical formula
When….
Statistical way of measuring
the center of a data set
•Mean(AM, GM, HM, Weighted mean)
•Median
•Mode
•Partition values
Median not mean, for:
(i)Open End Classes.
(ii)unequal class interval data table.
(ii)When data has several extreme values(outliers).
(iii) qualitative data( in frequency).
(IV) When data strongly lack normality
Mode is most frequently occurring value
•Less used
•Popular in business and industry
•Only way to locate central value when data is nominal
(How many type A sold? most preferred flavor of ice cream)
Mode & symmetry
Which Average is better?
AM is best for interval data, however it should not be used :
•For highly skewed data
•in open end classes.
•When there are very large and very small items(outliers).
•In case of average ratio and rate of change.
Median is the best average for:
•open end classes
•Skewed data or in presence of outliers
•For ordinal qualitative data eg.: less honest, honest, very honest
Mode is used for qualitative nominal data frequently used in
business and industry
Does Shape and Size of the data matters?
•Elongation of left or right tail is Skewness
•skewness described dataset’s symmetry –or lack of
symmetry.
•A perfectly symmetrical data set will have a skewness of
0.
Skewness
•Negative (left) skewness indicates more small values(on left tail)
•Positive (right) skewness indicates more large values(on right tail)
•kurtosis measures extreme values in either tail.
•Normal curve has no Kurtosis
•Kurtosis is measured comparing
the Normal curve
Calculation in EXCEL:
Statistics for lexp(life expectancy) using NHDR2014.xlsx
Calculation in EXCEL:
Statistics for priceusing auto.xlsx
Use data, nhdr2014 to calculate the following
1.Average life expectancy (‘life’)
2.Average gdppercapita (‘income’)
3.Average life expectancy (‘life’) of 3 ecologies (eg, average life(mountain)= …. )
4.Calculate Q1, Q2, Q3 of ‘income’
5.Using 3 quartiles of ‘income’ we can divide any other data in 4 equal parts.
Make a new variable, call it ‘groups’, that will have 4 value-labels according to
below criteria:
‘poor’ if below Q1
‘below average’, if between Q1 to Q2,
‘above average’, if between Q2 to Q3
‘rich’if above Q3
6.Find the average of ‘life’ & ‘hdi’ for this newly created variable with 4 groups
7.How many ‘districts’ falls in each of these ‘groups’? And which district has the
highest & lowest ‘life’ value that falls in each of these 4 ‘groups’?
8.Save this data for your future use
Dispersion in data is meaningful
Central value alone can disguise the picture
Variability is beauty of the wild nature
•Geographical variation generates
variety in species of flora and fauna
•Ethnography –cultural diversity
•Epidemiology treats variation in
disease
How to measure data dispersion?
Range
Standard Deviation
Quartile Deviation
Coefficient of variation
1. Range
Range= Largest value–Smallest value
•High Range in temperature acts for desertification
•Range of mobile sets
•Range of social disparity
3. Variance & Standard Deviation
•Most popular measure of variation
•It uses all observations
•Std(standard deviation) is the square root of variance
•Std= ??????????????????????????????????????????�
Sample VS population VARIANCE
For Papulation
s
2
=
(??????− ??????)²
??????
=
??????²
??????
−
??????
??????
2
(individual data)
s
2
=
??????(??????− ??????)²
??????
=
????????????²
??????
−
????????????
??????
2
Grouped data
For sample
S
2
=
(??????− ??????)²
??????−1
Also, S
2
=
??????
??????−1
s
2
S
2
=
??????
??????−1
s
2
=
??????
??????−1
�??????
2
??????
−
�??????
??????
2
When n ∞, sample mean population mean
Example: Variance and stdof the life
of electric bulbs(in hours)
Length of lifeNo. of bulbs
500–700 5
700–900 11
900–1100 26
1100–1300 10
1300–1500 8
Length of
life
No. of
bulbs
mid-
value
f X fx fx2
500–700 5 600 3000 1800000
700–900 11 800 8800 7040000
900–1100 26 1000 2600026000000
1100–130010 1200 1200014400000
1300–1500 8 1400 1120015680000
SUM 60 6100064920000
Mean = 1016.67
Variance =48388.89
Std= 219.9747
4. Coefficient of Variation(C.V.)
The co-efficient of variationis the relative measure based on the
standard deviation and is defined as the ratio of the standard
deviation to the mean expressed in percent.
C.V. =
??????
μ
x100%
It is used to compare the compactness of two or more data
Smaller C.V. indicates consistent or less variable data
C.V. is unit-less so data in same or different units can be compared
by it. eg. Weights in KG and in Pounds
Which type of electric bulbs has better consistency in life span?
Length of life
No. of
bulbs(alpha, a)
No. of
bulbs(beta, b)
fa fb
500–700 5 4
700–900 11 30
900–1100 26 12
1100–1300 10 8
1300–1500 8 6
Length of life
# bulbs
(alpha, a)
# bulbs
(beta, b)
Mid-value
fa fb X Xfa Xfb X2fa X2fb
500–700 5 4 600 3000240018000001440000
700–900 11 30 800 880024000704000019200000
900–1100 26 12 1000 26000120002600000012000000
1100–1300 10 8 1200 1200096001440000011520000
1300–1500 8 6 1400 1120084001568000011760000
SUM 60 60 61000564006492000055920000
mean(a)1016.7 mean(b) 940.0
std(a)=220.0 std(b)= 220.0
CV(a)21.64% CV(b) 23.4%
Hans Rosling
(27 July 1948 –7 February 2017)
most admired TED shows
Swedish epidemiologist with high data exploratory power
Gapminder foundation
2014 second time in Nepal from UNESCO
How not to be ignorant /The Joy of Statistics
( first 5 minutes of the total 1 hours Video)
http://www.gapminder.org/videos/the-joy-of-stats/