Data Mining
Data Data Data Data Data Data Data
Data Data Data Data Data Data Data
Data Data Data Data Data Data Data
●Howcanonefindallthemembersofahumangenefamily?
●Foragivenprotein,howcanonedeterminewhetherit
containsanyfunctionaldomainsofinterest?
●Howdoesonefindageneofinterestanddeterminethat
gene'sstructureandhowdoesoneeasilyexamineothergenes
inthatsameregion?
WHAT KIND OF INFORMATION YOU ARE MINING
uses informatics and statistics
helps extracting information out of a
huge amount of data
now accessible for everyone
DATA MINING
Data
•Publicly-available from Lambert Lab at
http://lambertlab.uams.edu/publicdata.htm
•105 samples run on Affymetrix HuGenFL
•74 Myeloma samples
•31 Normal samples
Three main data browsers
I.California university(http://genome.ucsc.edu/)
II.National Center for Biotechnology Information’s
Map-Viewer (http://www.ncbi.nlm.nih.gov/)
III.European Molecular Biology Laboratory -
European Bioinformatics Institute
(http://www.emsembl.org)
I.single-query analysis (-> genome browser)
II.selection of a set of genes that meet a criterion (->
"Sister programs")
III.more in-depth analysis (-> R/Bioconductor,
BiomaRt, ...)
3 levels in data mining
The genome browsers: UCSC & Ensembl
I. UCSC (University College of Santa Cruz)
Gene Sorter ●
Table Browser ●
II. Ensembl
BioMart
UCSC
Gene Sorter
Exploring genes families and the relationships among genes
Select genes based on several characteristic
UCSC
Gene Sorter
Table Browser
Query data using the database structure
Ensembl
BioMart
Database reorganised for an easier data minin
How toolboxes work
Common Approaches
•Comparing two measurements at a time
•Person 1, gene G: 1000
•Person 2, gene G: 3200
•Greater than 3-fold change: flag this gene
•Comparing one measurement with a population of
measurements… is it unlikely that the new
measurement was drawn from same distribution?
Approaches (Continued)
•Clustering or Unsupervised Data Mining
•Hierarchical Clustering, Self-Organizing (Kohonen) Maps
(SOMs), K-Means Clustering
•Cluster patients with similar expression patterns
•Cluster genes with similar patterns across patients or
samples (genes that go up or down together)
Approaches (Continued)
•Classification or Supervised Data Mining.
•Use our knowledge of class values… myeloma vs. normal,
positive response vs. no response to treatment, etc., to gain
added insight.
•Find genes that are best predictors of class.
•Can provide useful tests, e.g. for choosing treatment.
•If predictor is comprehensible, may provide novel insight,
e.g., point to a new therapeutic target.
Approaches (Continued)
•Classification or Supervised Learning.
•UC Santa Cruz: Furey et al. 2001 (support vector
machines).
•MIT Whitehead: Golub et al. 1999, Slonim et al. 2000
(voting).
•SNPs and Proteomics are coming.
Outline
•Data and Task
•Supervised Learning Approaches and Results
•Tree Models and Boosting
•Support Vector Machines
•Voting
•Bayesian Networks
•Conclusions