Session 1 -Getting started with R Statistics package.ppt
SarunasDriezis1
4 views
24 slides
Jun 10, 2024
Slide 1 of 24
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
About This Presentation
Data analysis with R
Size: 553.69 KB
Language: en
Added: Jun 10, 2024
Slides: 24 pages
Slide Content
Data analysis with R
statistical software
Dr.Rob Thomas
[email protected]
Session 1
Welcome to
the course
Getting started with R
Save the Excel file called “Heights.xls”as a .csv file
Excel > File > Save as > Save as type >
Choose “CSV (Comma delimited)
Save this file as “Heights.csv”
click on “OK”& “yes”when prompted
Open the R software on your computer
Ask R to read the heights.csv file
dframe1 <-read.csv(file.choose()) # navigate to your .csv file
names (dframe1) # names of the variables
summary (dframe1) # numerical summary
Histograms
with (dframe1, hist(Height))
hist (dframe1$Height)
Scatterplots
with (dframe1, plot(Age, Height))
plot (dframe1$Height~ dframe1$Age)
abline (lm (dframe1$Height~ dframe1$Age))
?plot # for help files about the “plot”function
Plotting graphs
–two ways of specifying variables
Boxplots
…with or without a “notch”
boxplot (dframe1$Height)
boxplot (dframe1$Height~ Sex)
boxplot (dframe1$Height~ Sex, notch=T)Female Male
1
3
0
1
4
0
1
5
0
1
6
0
1
7
0
1
8
0
1
9
0
H
e
ig
h
t
(
c
m
)
Descriptive statistics
(i) Measures of location(averages)
•Arithmetic mean= sum / n =
mean (dframe1$Height, na.rm= T)
•Median= the middle value of a ranked dataset
median (dframe1$Height, na.rm= T)
•Sample size (N)
length (dframe1$Height)
-sum (is.na(dframe1$Height))
1
n
x
n
(ii) Measures of variability
Sum of squares = (observation –mean)
2
0
1
2
…but the more data you have, the
bigger your measure of spread
Degrees of
freedom
Variance= sum of squares
n-1
var(dframe1$Height, na.rm=T))
…but the units are notthe same as the original measurements
Mean
Standard deviation (SD)= square root of the variance
Sum of
squares
Degrees of
freedom
…units arethe same
as the original
measurements
sqrt(var(dframe1$Height, na.rm= T))
or
sd(dframe1$Height, na.rm=T)
Standard error (SE)= standard deviation
n
A measure of how good our estimate of the mean is.
Surprisingly, R doesn’t have a ready-made function for
calculating SE, so we write our own function:
SD <-sd(dframe1$Height, na.rm= T)
N <-length(dframe1$Height) -sum(is.na(dframe1$Height))
SE <-SD/sqrt(N)
SE
The standard error is a useful descriptive statistic in itself,
but can also be used to calculate confidence intervals
Confidence Intervals (CI)
We are 95% confident that the true population mean lies within
the 95% CI limits
95% CI Limits = mean +/-(1.96 x SE)
99% CI Limits = mean +/-(2.58 x SE)
99.9% CI Limits= mean +/-(3.29 x SE)
lowerlimit<-(mean (dframe1$Height, na.rm= T) -1.96*SE)
upperlimit<-(mean (dframe1$Height, na.rm= T) + 1.96*SE)
lowerlimit
upperlimit
interval.width<-upperlimit-lowerlimit
interval.width
Statistical hypothesis testing
All stats tests ask the same question:
“is the observed pattern real, or simply due to chance?”
Test statistic= Variance explained by the model .
Variance not explained by the model
P-value our estimate of the probability
of Ho being true
The aim of hypothesis testing is to distinguish between:
•Patterns caused by random variation in a sample (H
o)
•Real biological patterns -differences or associations (H
1)
Evaluating results
•Biological effect size
•Statistical significance & statistical power
•Rejecting H
1does not mean that H
omust be true!
-1 1-2 2-3 3
Standard Deviations t= difference between the 2 means
random variation within each group
t= mean1–mean2
Estimate of SE
2-sample t-test
t.test (dframe1$x, dframe1$y)
Assumptions of
t-tests
1.Normal
distributions
2.Equal variances
-1 1-2 2-3 3
Standard Deviations t= difference between the 2 means
random variation within each group
t= mean1–mean2
Estimate of SE
•If t= 0, H
ois likely to be true
•If tis very large, H
ois
unlikely to be true.
•Compare observedvalue of
t, with the t-
distribution for the
relevant degrees of
freedom, to obtain the
prob. of H
o being true.
Pooled SE if variances are equal
Separate SE if variances are unequal
var.test (dframe1$x, dframe1$y)
2-sample t-test
t.test (dframe1$x, dframe1$y)
How does a
t-test work?
Exercise:
The Dr Ian Vaughan
Memorial Spreadsheet
The master at work:
Ian collecting data for his next t-test
Excel file in:
Session 1 folder
Left hand side:
Checking for normal distributions
Checking for homogeneity of variances
Right hand side:
t-test for equal variances
t-test for unequal variances
Reporting a t-test
There are 5 things that you must ALWAYSstate when reporting a
statistical test:
1.Name of the test E.g. 2-sample t-test
1-tailed or 2-tailed test? 2-tailed test
2. Value of the test statistic t = 5.164
3. Sample sizes n = 119, 186
or degrees of freedom d.f. = 303
4. Statistical significance P < 0.0001
i.e. a significant differencebetween male and female heights
5. Effect size and direction (means ±Confidence Intervals)
Males = 168.2cm (166.5-169.8), Females = 159.8cm (158.1-161.3)
Things to consider when evaluating
the test of a hypothesis
1. Effect size
Is the size of the effect important?
2. Statistical significance
P = probability the Ho is true
Accept or reject Ho
P = 0.05 is the conventional cut-off between significant
and non-significant effects, but this is arbitrary!
e.g. P = 0.0499is marginally significant
but P = 0.0501is marginally non-significant
…even though these 2 results are nearly identical
3. Statistical power
Lectures attended
1086420
Coursework marks (%)
100
80
60
40
20
0 No. of stats workshops attended
1816141210
86420
Love of statistics (index)
30
25
20
15
10
5
0 Workshops attended
Love of statistics
Question:Is this just
random variation?
(H
o)
Orare these statistically
significant patterns?
(H
A)
Association = a relationship or correlation
i.e.What is the probability of finding these patterns if there is
no real relationship between the 2 variables?
Tests for associations between 2
continuous variables
How is
correlation
calculated?
Pearson correlation assumes:
1.Linear relationship & 2. Normal distribution
Covariance:
(observed -mean of variable x)* (observed –mean of variable y)
n -1
Standardise the covariance by dividing by the standard deviation (SD)
Pearson correlation r= covariance of x & y
SDx* SDy
Student 1 2 3 4 5
Love&
Workshops
0
30
plot (x,y)
Positive relationship
Negative relationship
No correlation
Weak negative
relationship
Non-linear relationship?
–Try transforming one or both variables to make the relationship linear
Assumption 1: Linear relationship?
r= +1
r= -1
r= 0
r= -0.4
r= correlation
coefficient
Assumption 2:Normal distributions?
If the data are normally distributed, or can be transformed
(squashed) to be normally distributed, use a
Pearson’s correlation
Otherwise, use a Spearman’s rank correlation
…or a Kendall’s tau correlationwith small sample sizes (n<7)
and/or lots of tied ranks
How to test for correlations
To do a Pearsoncorrelation:
cor.test (dframe1$x, dframe1$y)
To do a Spearman rankor Kendall’s taucorrelation:
cor.test (dframe1$x, dframe1$y, method = “spearman”)
cor.test (dframe1$x, dframe1$y, method = “kendall”)
Reporting correlations
There are 6 things that you must ALWAYSstate when reporting a
statistical test:
1. Name of the test E.g. Pearson’s correlation
2. 1-tailed or 2-tailed test? 2-tailed test
3. Value of the test statistic r = 0.789
4. Sample size n = 18
or degrees of freedom d.f. = 16
5. Statistical significance P < 0.0001
i.e. a highly significant positive correlation
6. Effect size (note that the correlation coefficient ris also a
measure of the effect size / strength of the relationship)
The meaning of r
2
= the proportion of the variation in one variable that
is “explained”by variation in the other.
e.g. Correlation between love of statistics
and attendance
r= 0.789
r
2
= 0.623
…so 0.623 (i.e. 62.3%) of the variation in love of
statistics “explained by”variation in attendance
N.B.r
2
is only meaningful for Pearson’s correlation r,
notfor Spearman’s rank or Kendall’s tau correlations
Finding a significant correlation does
not prove cause and effect
e.g. correlation between CO
2and crime
But CO
2does not causecrime (or vice-versa)
Both are positively correlated with time
The “3
rd
variableproblem”
CO
2
Crime
e.g. correlation between cannabisuse
& psychological problems
Does cannabis use causepsychological problems?
Or do psychological problems causecannabis use?
Oris there a 3
rd
variable(e.g. income?)
that happens to be correlated with both?
Correlations can highlight areas
for experimentalresearch