phd ppt2 sample reference download1.pptx

ArumugamP26 37 views 56 slides Apr 27, 2024
Slide 1
Slide 1 of 56
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56

About This Presentation

ref dpoc2


Slide Content

Hierarchical information representation and efficient classification of gene expression microarray data PhD candidate : Mattia Bosio Advisors : Philippe Salembier Albert Oliveras Vergés 27/06/2014 Mattia Bosio PhD thesis defense 1

Thesis objective Develop algorithms for microarray classification Predictive performance Results stability Biological interpretability 27/06/2014 Mattia Bosio PhD thesis defense 2

Roadmap 3 27/06/2014 Mattia Bosio PhD thesis defense 1- Microarrays 2- Challenges & Opportunities 3- Contributions 4- How did we get there ? 5- Conclusions

27/06/2014 Mattia Bosio PhD thesis defense 4 Challenges & Opportunities 1- Microarrays

A platform to measure gene expression 27/06/2014 Mattia Bosio PhD thesis defense 5 Give a picture of the whole cellular state Thousands of parallel measures Measure how much each gene is being used Can be used to discriminate between populations

Microarrays: what do they measure 27/06/2014 Mattia Bosio PhD thesis defense 6

Microarrays: how do they look like 27/06/2014 Mattia Bosio PhD thesis defense 7 45’000 ‘ Genes ’ 72 Samples

27/06/2014 Mattia Bosio PhD thesis defense 8 Challenges & Opportunities 2- CHALLENGES & OPPORTUNITIES

Challenges 27/06/2014 Mattia Bosio PhD thesis defense 9 Lack of structure Noise Sample size vs dimensions 45’000 ‘ Genes ’ 72 Samples

Opportunities 27/06/2014 Mattia Bosio PhD thesis defense 10 Established tool for research but no optimum algorithm yet for classification Machine learning has already been used Good results that can be improved Signal processing dealt with similar problems

27/06/2014 Mattia Bosio PhD thesis defense 11 Contributions 3- CONTRIBUTIONS

27/06/2014 Mattia Bosio PhD thesis defense 12 Two-step classification framework Genes Feature set Enhancement Feature Selection Classifier Train Data Validation Data Class Estimations Metagenes 1. Metagenes 2. IFFS 3. Ensemble 4. Knowledge Integration 5. Multiclass algorithm

Contributions 27/06/2014 Mattia Bosio PhD thesis defense 13 Metagenes are helpful for classification Tailored IFFS algorithm  improves state of the art Ensemble learning proof of concept led to interesting results Knowledge integration framework improves interpretability and robustness OAA+PAA as a valid multiclass algorithm

4- HOW DID WE GET THERE? 27/06/2014 Mattia Bosio PhD thesis defense 14

Index for this section 27/06/2014 Mattia Bosio PhD thesis defense 15

4.1 Feature set Enhancement A structure is inferred from the data and new metagenes are created . 27/06/2014 Mattia Bosio PhD thesis defense 16

Feature set enhancement Addresses Noise and Lack of structure A binary tree is inferred Each node is a new feature New features are called metagenes Metagenes reduce noise by clustering similar genes 27/06/2014 Mattia Bosio PhD thesis defense 17

Feature set enhancement The iterative process of metagene generation Iterative process based on Treelets [1] The two most similar features are substituted by a metagene Two key elements: Similarity Metric Metagene generation algorithm 18 [1] A. B. Lee, B. Nadler, L. Wasserman, Treelets - an adaptive multi-scale basis for sparse unordered data, Annals of Applied Statistics 2 (2) (2008) 435 {471}. Find the most similar feature pair Feature set Build metagene from the 2 features Remove 2 features from the set + add the metagene

4.2 Feature selection : IFFS How to select the right features to discriminate between classes with an iterative, wrapper algorithm 27/06/2014 Mattia Bosio PhD thesis defense 19

IFFS: Find the few best features to classify “Improved Sequential Floating Forward Selection (IFFS)” [2]: Sequential , deterministic wrapper algorithm Flexible method : at each iteration decide if Add , Delete or Substitute a feature Alternatives are compared by a J(·) score 20 [2] S. Nakariyakul , D. Casasent , An improvement on floating search algorithms for feature subset selection, Pattern Recognition.

IFFS: Find the few best features to classify Deterministic sequential wrapper algorithm All the decisions determined by a J(·) score Usually J(·) is an error rate estimation Ties are frequent due to the sample scarcity 27/06/2014 Mattia Bosio PhD thesis defense 21 [2] S. Nakariyakul , D. Casasent , An improvement on floating search algorithms for feature subset selection, Pattern Recognition.

J( ·) score tailored for microarrays 27/06/2014 Mattia Bosio PhD thesis defense 22 Reliability measure to break ties in J(·) Three rules to define the score combining error rate and reliability: Lexicographic sorting Exponential penalization Linear combination J( · ) score depends on 2 parameters : Error rate Reliability

IFFS: Experimental setup Datasets from MAQC study phase II [4] 7 datasets with hundreds of samples 30.000+ models evaluated Independent validation sets available Common evaluation procedure 23 [4] L. Shi , et al., The microarray quality control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models., Nature biotechnology 28 (2010) 827-38.

IFFS: Experimental setup 24 Genes Feature set Enhancement IFFS Classifier Train Data Validation Data Class Estimations Metagenes

IFFS: experiment objectives Evaluate if metagenes are useful Benchmark with state of the art Comparison following MAQC standard: Matthews Correlation Coefficient 27/06/2014 Mattia Bosio PhD thesis defense 25  

Results : Metagenes are useful 27/06/2014 Mattia Bosio PhD thesis defense 26 Introducing metagenes gives better results

The proposed framework improves state of the art results 27/06/2014 Mattia Bosio PhD thesis defense 27

Observations The proposed framework works with both its key elements Metagenes are useful ( contrib #1) IFFS adapted to microarrays improves the state of the art ( contrib #2) 27/06/2014 Mattia Bosio PhD thesis defense 28

4.3 Feature selection : Ensemble How to select the right features to discriminate between classes with a novel ensemble learning algorithm 27/06/2014 Mattia Bosio PhD thesis defense 29

Ensemble learning - voting scheme Ensemble combine experts with a voting scheme One expert for each available feature Expert = Trained Classifier output on analyzed data 1 Expert = 1 feature The feature selection becomes an Expert subset selection problem 27/06/2014 Mattia Bosio PhD thesis defense 30

Accuracy In Diversity [7] the original algorithm Starts with p experts : One for each feature Sequentially removes the expert with worst error rate on a subset S In [6], a simpler version is defined: Kun algorithm 27/06/2014 Mattia Bosio PhD thesis defense 31 [6] L. Kuncheva , Combining Pattern Classifiers: Methods and Algorithms.Wiley , 2004. [7]R. E. Banfield , L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer , “A new ensemble diversity measure applied to thinning ensembles.” in Multiple Classifier Systems , ser . Lecture Notes in Computer Science, T. Windeatt and F. Roli , Eds., vol. 2709. Springer, 2003, pp. 306–316.

Accuracy In Diversity the original algorithm 27/06/2014 Mattia Bosio PhD thesis defense 32 PCDM (d) = % of experts correctly classifying sample i S set formed of samples with The expert with worst error rate on S is excluded   90% 50% 80% 100% 100% EXPERTS SAMPLES PCDM VOTE AID Kun AID Kun

Adaptations to microarrays Nonexpert : Exclude experts unable to find 2 classes in the training set Metagenes : included as experts Tie-break rule : the expert upper in the tree is excluded 27/06/2014 Mattia Bosio PhD thesis defense 33

Ensemble: experiment objectives Comparison between AID and Kun ensemble algorithms. Benchmark with state of the art. Comparison following MAQC standard: Matthews Correlation Coefficient 27/06/2014 Mattia Bosio PhD thesis defense 34  

Ensemble algorithms improve the state of the art 27/06/2014 Mattia Bosio PhD thesis defense 35 Both algorithms improve state of the art The simpler Kun algorithm is the best option

Observations Ensemble learning feature selection led to encouraging results. The proposed ensemble learning improves the state of the art. (contrib #3) Tailoring the algorithm to the data benefits the results. 27/06/2014 Mattia Bosio PhD thesis defense 36

4.4 Knowledge integration Introducing prior biologial knowledge to improve the metagene generation phase. The aim is to obtain more robust performance and more biologically interpretable gene selections 27/06/2014 Mattia Bosio PhD thesis defense 37

Integration of external biological data when producing metagenes 27/06/2014 Mattia Bosio PhD thesis defense 38 Genes Feature set Enhancement Feature Selection Classifier Train Data Validation Data Class Estimations New metagenes Biological Knowledge (MSigDb...)

Objectives of this section Measures to quantify biological similarity Develop ways to integrate both sources of info Numerical correlation & Biological similarity Benchmarking : Predictive power | Results stability | Biological interpretability 27/06/2014 Mattia Bosio PhD thesis defense 39

Distances and merging algorithms 4 similarity metrics studied : Godall | Smirnov | NoisyOR | Anderberg 2 criteria to merge numerical and biological info Average | pdf equalization 27/06/2014 Mattia Bosio PhD thesis defense 40

Experimental setup 7 MAQC datasets 50-run Monte Carlo experiments Novel scoring system integrating Numerical results and Biological analysis tools 27/06/2014 Mattia Bosio PhD thesis defense 41

Comparative scoring system Predictive performance from MCC values Rank by decreasing = best   Biological analysis 4 parallel analysis tools GSEA | Biograph | Genie | Enrichr 4 parallel rankings Average biological rankings 27/06/2014 Mattia Bosio PhD thesis defense 42 1 1 3 6 2 3 Final score = rank average 2 The best algorithm has the smallest final score

Predictive power scoring & ranking shows G_pdf as the best solution 27/06/2014 Mattia Bosio PhD thesis defense 43 Bio . Analysis Predictive Rank . Final Score pdf_equalization average

Compared with state of the art, G_pdf confirms to be the best alternative 27/06/2014 Mattia Bosio PhD thesis defense 44 Final Score IFFS

Observations about knowledge integration Improved results in terms of results stability and interpretability Godall similarity with pdf-equalization scheme is the best way to integrate prior databases G- pdf performance confirmed against state of the art alternatives too ( contrib #4) 27/06/2014 Mattia Bosio PhD thesis defense 45

4.5 multiclass classification Study of a novel algorithm for multiclass classification applying coding theory on multiple binary classifiers 27/06/2014 Mattia Bosio PhD thesis defense 46

Multiclass approach combining multiple binary classifiers Common methods like One Against All (OAA) or One Against One (OAO) can be improved. Information coding  good results[119] Propose a novel approach with ECOC ideas 27/06/2014 Mattia Bosio PhD thesis defense 47 [119] E. Tapia, L. Ornella , P. Bulacio , and L. Angelone . Multiclass classication of microarray data samples with a reduced number of genes. BMC Bioinformatics 2011.

Our proposal : OAA+PAA Choice to combine several experts : OAA = one classifier per class PAA = one classifier separating each class-pair Expert = bit in a codeword Class estimation by distance with reference words 27/06/2014 Mattia Bosio PhD thesis defense 48 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M binary classifiers h1 h2 … hM N = 4 Classes

Experiments on 7 public datasets Binary classifiers trained with Treelet + IFFS Compared with OAA, OAO and state of the art alternatives[119 ] 50 run Monte Carlo run of 4:1 cross validation. 27/06/2014 Mattia Bosio PhD thesis defense 49 [119] E. Tapia, L. Ornella , P. Bulacio , and L. Angelone . Multiclass classication of microarray data samples with a reduced number of genes. BMC Bioinformatics 2011.

Average accuracy 27/06/2014 Mattia Bosio PhD thesis defense 50 OAA+PAA is better than OAA, OAO and state of the art alternatives Accuracy

Observations about OAA+PAA It consistently outperforms OAA and OAO algorithms Obtains better accuracy than state of the art alternatives from [119 ] OAA+PAA is a valid multiclass algorithm (contrib#5) 27/06/2014 Mattia Bosio PhD thesis defense 51 [119] E. Tapia, L. Ornella , P. Bulacio , and L. Angelone . Multiclass classication of microarray data samples with a reduced number of genes. BMC Bioinformatics 2011.

27/06/2014 Mattia Bosio PhD thesis defense 52 5- CONCLUSIONS

Two-step approach is the main contribution Feature set enhancement Addresses lack of structure Addresses noise Feature selection & classification Choose the best variables among thousands available with new algorithms 27/06/2014 Mattia Bosio PhD thesis defense 53

Validated contributions Metagenes are helpful for classification Tailored IFFS algorithm  improves state of the art Ensemble learning algorithm led to interesting results Knowledge integration framework improves interpretability and robustness OAA+PAA as a valid multiclass algorithm 27/06/2014 Mattia Bosio PhD thesis defense 54

Publications Bosio M, Bellot P, Salembier P, Oliveras A. “ Gene Expression Data Classification Combining Hierarchical Representation and Efficient Feature Selection ”. Journal of Biological Systems. 2012;20:349-375. Bosio M, Bellot P, Salembier P, Oliveras A. “ Feature set enhancement via hierarchical clustering for microarray classification ”. IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS 2011. ; 2011. pp. 226 -229 Bosio M, Bellot P, Salembier P, Oliveras A. “ Microarray classification with hierarchical data representation and novel feature selection criteria ”. In: IEEE 12th International Conference on BioInformatics and BioEngineering . Larnaca , Cyprus; 2012. Bosio M, Bellot P, Salembier P, Oliveras A. “ Multiclass cancer microarray classification algorithm with Pair-Against-All redundancy ”. In: The 2012 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS’12). Washington, DC, USA; 2012. Bosio M, Salembier P, Bellot P, Oliveras A. “Hierarchical clustering combining numerical and biological similarities for gene expression data classification”. 35th Conference of the IEEE Engineering in Medicine and Biology Society (EMBC'13). Osaka, Japan 07/2013 M. Bosio, Salembier , P., Oliveras , A., and Bellot , P., “ Ensemble feature selection and hierarchical data representation for microarray classification ”, in 13th IEEE International Conference on BioInformatics and BioEngineering BIBE, Chania , Crete, 2013. 27/06/2014 Mattia Bosio PhD thesis defense 55 IFFS KUN BIOINFO MCLASS METAGENES

Future research directions Study a better use of the tree structure Integrate more information sources Deepen knowledge for ensemble learning Study applicability for Next Generation Seq analysis or other ‘ omics ’ platforms 27/06/2014 Mattia Bosio PhD thesis defense 56
Tags