data_mining- principle and application in biology

ShibsekharRoy1 8 views 13 slides Jun 29, 2024
Slide 1
Slide 1 of 13
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13

About This Presentation

data_mining- principle and application in biology


Slide Content

Data Mining
Data Data Data Data Data Data Data
Data Data Data Data Data Data Data
Data Data Data Data Data Data Data

●Howcanonefindallthemembersofahumangenefamily?
●Foragivenprotein,howcanonedeterminewhetherit
containsanyfunctionaldomainsofinterest?
●Howdoesonefindageneofinterestanddeterminethat
gene'sstructureandhowdoesoneeasilyexamineothergenes
inthatsameregion?
WHAT KIND OF INFORMATION YOU ARE MINING

uses informatics and statistics
helps extracting information out of a
huge amount of data
now accessible for everyone
DATA MINING

Data
•Publicly-available from Lambert Lab at
http://lambertlab.uams.edu/publicdata.htm
•105 samples run on Affymetrix HuGenFL
•74 Myeloma samples
•31 Normal samples

Three main data browsers
I.California university(http://genome.ucsc.edu/)
II.National Center for Biotechnology Information’s
Map-Viewer (http://www.ncbi.nlm.nih.gov/)
III.European Molecular Biology Laboratory -
European Bioinformatics Institute
(http://www.emsembl.org)

I.single-query analysis (-> genome browser)
II.selection of a set of genes that meet a criterion (->
"Sister programs")
III.more in-depth analysis (-> R/Bioconductor,
BiomaRt, ...)
3 levels in data mining

The genome browsers: UCSC & Ensembl
I. UCSC (University College of Santa Cruz)
Gene Sorter ●
Table Browser ●
II. Ensembl
BioMart

UCSC
Gene Sorter
Exploring genes families and the relationships among genes
Select genes based on several characteristic
UCSC
Gene Sorter
Table Browser
Query data using the database structure
Ensembl
BioMart
Database reorganised for an easier data minin
How toolboxes work

Common Approaches
•Comparing two measurements at a time
•Person 1, gene G: 1000
•Person 2, gene G: 3200
•Greater than 3-fold change: flag this gene
•Comparing one measurement with a population of
measurements… is it unlikely that the new
measurement was drawn from same distribution?

Approaches (Continued)
•Clustering or Unsupervised Data Mining
•Hierarchical Clustering, Self-Organizing (Kohonen) Maps
(SOMs), K-Means Clustering
•Cluster patients with similar expression patterns
•Cluster genes with similar patterns across patients or
samples (genes that go up or down together)

Approaches (Continued)
•Classification or Supervised Data Mining.
•Use our knowledge of class values… myeloma vs. normal,
positive response vs. no response to treatment, etc., to gain
added insight.
•Find genes that are best predictors of class.
•Can provide useful tests, e.g. for choosing treatment.
•If predictor is comprehensible, may provide novel insight,
e.g., point to a new therapeutic target.

Approaches (Continued)
•Classification or Supervised Learning.
•UC Santa Cruz: Furey et al. 2001 (support vector
machines).
•MIT Whitehead: Golub et al. 1999, Slonim et al. 2000
(voting).
•SNPs and Proteomics are coming.

Outline
•Data and Task
•Supervised Learning Approaches and Results
•Tree Models and Boosting
•Support Vector Machines
•Voting
•Bayesian Networks
•Conclusions