Myths in science & statistics s ha h.ppt

amitbajhaiya 13 views 50 slides Sep 06, 2024
Slide 1
Slide 1 of 50
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50

About This Presentation

klsbnfckhshiv


Slide Content

Myths and Statistical Principles in
DNA Microarray Research
Richard Simon, D.Sc.
Chief, Biometric Research Branch
Head, Molecular Statistics & Bioinformatics
National Cancer Institute

•All cells of a multi-cellular organism contain
essentially the same DNA
•Cells differ in function based on the spectra of
which genes are expressed and the level of
expression
•Proteins do the work of cells and gene expression
determines the intra-cellular concentration of
proteins
•mRNA is an intermediate product of gene
expression; a gene is transcribed into a mRNA
molecule which is then translated into a protein
molecule

Types of DNA Microarrays
•mRna transcript quantification
•Genomic DNA sequence determination
–SNP identification
–Genotyping
•Detecting gene deletions or gene
duplications

Types of Microarrays
•DNA microarrays
•Tissue microarrays
•Protein microarrays

[Affymetrix] Hybridization Array

Biology in Transition
•Biotechnology
–Restriction enzymes
–Ligases
–Polymerases
–PCR
•Instruments, Tools, Reagents and Information
Resources of Major Impact
–DNA sequencing
–Functional whole genomic assays

How to Deal With the Plethora of
Data
•Development of software tools
•Training of biologists to use tools
•Collaboration with mathematical &
computational scientists
•Training of mathematical & computational
scientists

Bioinformatics
•An ambiguous term that helps further confuse
people who are sometimes already confused
•Refers to a range of activities all of which involve
multi-disciplinary collaboration among biological,
mathematical, computational scientists and
software engineers
•Organizations searching for structures that will
support quality inter-disciplinary research in
bioinformatics

Organizing for Bioinformatics
•Collaborative, not service oriented
•Enable extensive interaction and education
•Enable scientists to be stimulated by
important problems and to accomplish
organizational and personal goals in solving
them

Molecular Statistics &
Bioinformatics Section
•Utilize mathematical and computational
sciences in conjunction with data from
genomics & high thruput technologies to
elucidate the biological basis of cancer
–translating this to effective means of eradicating
cancer
•Train statisticians, mathematicians, physical
and biological scientists in cancer
computational biology

Microarray Research
•Collaborative data analysis
•Methodology development
•Software development

Microarray Myths
•That the greatest challenge is managing the mass of
micro-array data
•That pattern-recognition or data mining are the most
appropriate paradigm for the analysis of micro-array
data
•That pre-packaged analysis tools are a substitute for
collaboration with statistical scientists in complex
problems
•That statistical collaboration can be a service function
•That statisticians can be effective collaborators
without substantial knowledge of biology and
microarray technology

Applications of DNA
Microarrays to Cancer Research
•Identify genes and pathways involved in
oncogenesis
–Transgenic mouse models
–Profiling pre-cancerous lesions
•Identifying molecular targets for
–therapeutics
–early detection

Applications of DNA
Microarrays to Cancer Research
•Diagnostic classification
–For identifying disease subsets with distinctive
pathogenesis
–For selecting therapy
•Large cell lymphoma
•Stage I breast cancer

DNA Microarray Analytics
•Design issues
–Arrays
–Specimens
•Labeling
•Replication
•Image analysis
–Pixels to feature
•Feature analysis
–Background
adjustment
–Normalization
–Features to genes
–Normalization
•Analysis of biological
objectives

Method of Analysis Should Be
Tailored to Objectives
•Class discovery
–Identifying expression profiles characteristic of
non-predefined subsets of tumors
•Class/phenotype prediction
–Identifying expression profiles that distinguish
predefined subsets of tumors

Components of Class Prediction
•Establish that expression “profiles” differ to a
statistically significant degree and that differences
observed are not due to examination of thousands
of genes
•Identify genes that account for the differences
between classes
•Develop multi-gene classifier to predict the class
for a new sample and estimate the mis-
classification rates

Do Expression Profiles Differ for
Two Defined Classes of Arrays?
•Not a clustering problem
–Global similarity measures generally used for
clustering arrays may not distinguish classes
•Supervised vs unsupervised methods
•Requires multiple biological samples from
each class

Do Expression Profiles Differ for
Two Defined Classes of Arrays?
•Global test
–Number of genes significantly differentially expressed
among classes at specified nominal significance level
–Cross-validated mis-classification rate
•Multiple comparison adjustment for finding
differentially expressed genes
–Experiment-wise error
–Univariate screening with p<0.001 threshold
–False discovery rate

training set
test set
s
p
e
c
i
m
e
n
s
log-expression ratios
s
p
e
c
i
m
e
n
s
log-expression ratios
full data set
Non-cross-validated Prediction
Cross-validated Prediction (Leave-one-out method)
1. Prediction rule is built using full data set.
2. Rule is applied to each specimen for class
prediction.
1. Full data set is divided into training and
test sets (test set contains 1 specimen).
2. Prediction rule is built using the training
set.
3. Rule is applied to the specimen in the
test set for class prediction.
4. Process is repeated until each specimen
has appeared once in the test set.

Prediction on Simulated Null Data
Generation of Gene Expression Profiles
• 14 specimens (P
i
is the expression profile for specimen i)
• Log-ratio measurements on 6000 genes
• P
i ~ MVN(0, I
6000)
• Can we distinguish between the first 7 specimens (Class 1) and the last 7
(Class 2)?
Prediction Method
• Compound covariate prediction (discussed later)
• Compound covariate built from the log-ratios of the 10 most differentially
expressed genes.

Percentage of simulated data sets
with m or fewer misclassifications
m
Non-cross-validated
class prediction
Cross-validated
class prediction
0 99.85 0.60
1 100.00 2.70
2 100.00 6.20
3 100.00 11.20
4 100.00 16.90
5 100.00 24.25
6 100.00 34.00
7 100.00 42.55
8 100.00 53.85
9 100.00 63.60
10 100.00 74.55
11 100.00 83.50
12 100.00 91.15
13 100.00 96.85
14 100.00 100.00

Exact Permutation Test
Premise: Under the null hypothesis of no systematic difference in
expression profiles between the two classes, it can be assumed that
assignment of class labels to expression profiles is purely coincidental.
Performing the test
1. Consider every possible permutation of the class labels among the
gene expression profiles.
2. Determine the proportion of the permutations that result in a
misclassification error rate less than or equal to the observed error
rate.
3. This proportion is the achieved significance level in a test of the
null hypothesis.

Examining all permutations is computationally burdensome.
Instead, a Monte Carlo method is used…
• n
perm permutations of the labels are randomly generated.
• The proportion of these permutations that have m or fewer
misclassifications is an estimate of the achieved significance
level in a test of the null hypothesis.
• n
perm is chosen such that the variability in the estimate is less
than an acceptable level.
• If the true proportion of permutations with m  2 is 0.05,
n
perm
= 2000 ensures the coefficient of variation of the
estimate of the achieved significance level is less than 0.1.
Monte Carlo Permutation Test

Gene-Expression Profiles in
Hereditary Breast Cancer
• Breast tumors studied:
7 BRCA1+ tumors
8 BRCA2+ tumors
7 sporadic tumors
• Log-ratios measurements of
3226 genes for each tumor
after initial data filtering
cDNA Microarrays
Parallel Gene Expression Analysis
RESEARCH QUESTION
Can we distinguish BRCA1+ from BRCA1– cancers and BRCA2+ from
BRCA2– cancers based solely on their gene expression profiles?

The Compound Covariate Predictor (CCP)
•We consider only genes that are differentially expressed between
the two groups (using a two-sample t-test with small ).
•The CCP
–Motivated by J. Tukey, Controlled Clinical Trials, 1993
–Simple approach that may serve better than complex multivariate
analysis
–A compound covariate is built from the basic covariates (log-ratios)
t
j
is the two-sample t-statistic for gene j.
x
ij
is the log-ratio measure of sample i for gene j.
Sum is over all differentially expressed genes.
•Threshold of classification: midpoint of the CCP means for the two
classes.

j
ijji xtCCP

Classification of hereditary breast cancers with the compound covariate predictor
Class labels
Number of
differentially
expressed genes
m = number of
misclassifications
Proportion of random
permutations with m or
fewer misclassifications
BRCA1
+
vs. BRCA1

9 1 (0 BRCA1
+
, 1 BRCA1

) 0.004
BRCA2
+
vs. BRCA2

11 4 (3 BRCA2
+
, 1 BRCA2

) 0.043

Accuracy of class prediction as
selection stringency increases
BRCA1
Classification
BRCA2
Classification
α ngene m ngene m
10
-2
182 3 212 4
10
-3
53 2 49 3
10
-4
9 1 11 4

Advantages of Compound
Covariate Classifier
•Good feature selection
•Does not over-fit data
–Incorporates influence of multiple predictive
variables without attempting to select the best
small subset of variables
–Does not attempt to model the multivariate
interactions among the predictors and outcome

Extensions
•Adjustment for covariates
•Paired samples
•Survival data
•Other classification methods
•More than 2 classes

1
-
2
8
2
1
-
2
6
4
2
-
3
2
-
1
8
2
-
5
1
-
2
3
1
-
6
P
2
-
1
1
2
-
1
4
1
-
6
2
2
-
5
6
2
-
8
2
2
-
1
1
0
2
-
5
12
-
6
7
2
-
7
5
2
-
6
1
2
-
9
4
2
-
7
7
2
-
1
8
5
1
-
6
9
2
-
8
9
1
-
9
5
2
-
1
1
3
1
-
8
6
1
-
8
1
1
-
8
4
1
-
9
P
1
-
9
6
1
-
9
9
P
1
-
1
0
4
P
1
-
1
2
P
1
-
4
4
8
2
-
1
7
5
2
-
2
6
5
2
-
2
0
1
-
1
3
2
-
3
2
1
-
7
6
U
1
-
7
8
1
-
8
8
1
-
9
1
2
-
2
2
2
-
1
0
2
-
1
1
4
1
-
2
5
92
-
1
4
2
2
-
1
1
7
2
-
1
0
3
2
-
1
2
5
2
-
1
0
0
2
-
1
2
9
2
-
3
0
2
-
9
3
2
-
1
1
6
2
-
1
3
0
1
-
1
2
7
P
2
-
9
0
1
-
1
2
P
N
1
-
1
3
N
1
-
1
4
N
1
-
6
P
N
1
-
1
1
N
1
-
9
P
N
1
-
1
2
P
N
1
-
8
1
N
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
-
c
e
n
t
e
r
e
d

c
o
r
r
e
la
t
io
n
4922x58 stage-4 tumor arrays and normals

2
8
2
2
6
4
3
1
8
5
2
3
6
P
1
1
1
4
6
2
5
6 8
2
1
1
0
5
1
6
7
7
5
6
1
9
4
7
7
1
8
5
6
9
8
9
9
5 1
1
3
8
6
8
1
8
4
9
P
9
6
9
9
P
1
0
4
P
1
2
P
4
4
8
1
7
5
2
6
5 2
0
1
3
3
27
6
U
7
8
8
8
9
1
2
2
1
0
1
1
4
2
5
9
1
4
21
1
7
1
0
3
1
2
5
1
0
0
1
2
9
3
0
9
3
1
1
6
1
3
0
1
2
7
P
9
0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
.
2
1
-
c
e
n
t
e
r
e
d

c
o
r
r
e
la
t
io
n
Genes are significantly associated with survival

s
u
r
v
iv
a
l
0 500 1000 1500
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Survival Analysis by Clustering Patients With Genes Significantly Associated with Survival
Genes are chosen at .001 significance level
good prognosis, N=34
poor prgnosis, N=24

s
u
r
v
iv
a
l
0 500 1000 1500
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Cross-validation Survival Analysis by Clustering Patients with Genes Significantly Associated with Survival
Withheld observation is classified by the K nearest neighbor rule (K=5)
Genes are chosen at .001 significance level
good prognosis, N=32
poor prognosis, N=26

Class Discovery
•For determining whether a set of tumors is
homogeneous with regard to expression
profile

Class Discovery Methods
•Cluster analysis
•Multi-dimensional Scaling

0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
0
.
7
0
.
8

1
1

2

9

1
2

6

2
5

1
0

4

1
3

1

3

2
8

2
9

2
6

2
1

7

2
3

3
0

2
2

2
4

8

3
1

1
8

2
7

5

1
7

1
9

2
0

1
4

1
5

1
6
1

-
c
o
r
r
e
l
a
t
i
o
n
Melanoma Gene Expression Data
19 tumor cluster of interest
Q: Can gene expression profiles of melanoma be used to distinguish
sub-classes of disease? (M. Bittner et al., Nature Genetics Aug 2000)

Validation of Clusters
•Clustering algorithms find clusters, even
when they are spurious
•Clusters found may change with re-assaying
tumors or selection of new tumors

Clustering Arrays
•Cluster significance
•Cluster reproducibility

Cluster reproducibility
•Add perturbation noise to original data
•Re-cluster perturbed data to assess stability of
original clusters
•D: Proportion of pairs of samples in a specified
cluster of the original data that are in separate
clusters after perturbation
•R: Average number of specimens lost or gained in
a specified cluster || CP(C) - CP(C) ||

Melanoma Data:
mn-error Method - Individual Clusters
k
Cluster
Membershipmn
75-240.000.00
850.005.13
86-240.000.27
0
.2
0
.3
0
.4
0
.5
0
.6
0
.7
0
.8
1
1
2
9
1
2
6
2
5
1
0
4
1
3
1
3
2
8
2
9
2
6
2
1
7
2
3
3
0
2
2
2
4
8
3
1
1
8
2
7
5
1
7
1
9
2
0
1
4
1
5
1
6

Test of Cluster Significance
•Multivariate Gaussian null hypothesis
•Project to subspace determined by first three principal
components
•Compute EDF of nearest neighbor Euclidean distances
between samples
•Compare the NN EDF observed to that expected under
the null distribution using a squared difference
discrepancy metric
•Estimate null distribution by sampling from 3D Gaussian
distribution with mean and covariance matrix
corresponding to observed data

BRB ArrayTools:
An integrated package for the
analysis of DNA microarray data
http://linus.nci.nih.gov/BRB-ArrayTools.html

BRB ArrayTools
Design Objectives
•Easy user interface
–Excel front-end
•Ease of data loading
–integrated
•Drill-down linkage to
genomic databases
•Educating biologists in
microarray data
analysis
•Powerful analytic &
visualization tools
•Easily extensible
–R backend
•Portable
–Non-proprietary
•Ease of development
–R back-end

Collaborators
•Molecular Statistics & Bioinformatics
–Kevin Dobbin
–Lisa McShane
–Amy Peng
–Michael Radmacher
–Joanna Shih
–George Wright
–Yingdong Zhao