PRIDE-ProteomeXchange

JuanAntonioVizcaino 601 views 70 slides Dec 17, 2015
Slide 1
Slide 1 of 70
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70

About This Presentation

Lecture of the PRIDE resources and ProteomeXchange for the Wellcome Trust Proteomics Bioinformatics Course 2015


Slide Content

PRIDE resources and ProteomeXchange Dr. Juan Antonio Vizcaíno PRIDE Group Coordinator Proteomics Services Team EMBL-EBI Hinxton , Cambridge, UK

Data resources at EMBL-EBI Genes, genomes & variation RNA Central ArrayExpress Expression Atlas Metabolights PRIDE InterPro Pfam UniProt ChEMBL ChEBI Molecular structures Protein Data Bank in Europe Electron Microscopy Data Bank European Nucleotide Archive European Variation Archive European Genome- phenome Archive Gene, protein & metabolite expression Protein sequences, families & motifs Chemical biology Reactions, interactions & pathways IntAct Reactome MetaboLights Systems BioModels Enzyme Portal BioSamples Ensembl Ensembl Genomes GWAS Catalog Metagenomics portal Europe PubMed Central Gene Ontology Experimental Factor Ontology Literature & ontologies

PRIDE Archive (in the context of ProteomeXchange and the PSI standards) How to submit data to PRIDE: PRIDE tools How to access data in PRIDE Archive PRIDE Cluster and PRIDE Proteomes Overview

PRIDE Archive (in the context of ProteomeXchange and the PSI standards) How to submit data to PRIDE: PRIDE tools How to access data in PRIDE Archive PRIDE Cluster and PRIDE Proteomes Overview

ProteomeXchange Consortium Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories. Includes PeptideAtlas (ISB, Seattle), PRIDE (Cambridge, UK) and (very recently) MassIVE (UCSD, San Diego) . Common identifier space (PXD identifiers) Two supported data workflows: MS/MS and SRM . Main objective: Make life easier for researchers http :/ /www.proteomexchange.org

PRIDE stores mass s pectrometry (MS)-based proteomics data: Peptide and protein expression data (identification and quantification) Post-translational modifications Mass spectra (raw data and peak lists) Technical and biological metadata Any other related information Full support for tandem MS approaches PRIDE ( PRoteomics IDEntifications ) database http://www.ebi.ac.uk/pride/archive Martens et al. , Proteomics , 2005 Vizcaíno et al. , NAR , 2013

PRIDE Mission To archive all types of proteomics mass spectrometry data for the purpose of supporting reproducible research , allowing the application of quality control metrics and enabling the reuse of these data by other researchers. To integrate MS-based data in a protein-centric manner to provide information on protein variants, modifications, and expression. To provide mass spectrometry based expression data to the Expression Atlas .

PRIDE Mission To archive all types of proteomics mass spectrometry data for the purpose of supporting reproducible research , allowing the application of quality control metrics and enabling the reuse of these data by other researchers. To integrate MS-based data in a protein-centric manner to provide information on protein variants, modifications, and expression. To provide mass spectrometry based expression data to the Expression Atlas .

Data content in PRIDE Archive Submission driven resource PRIDE is split in datasets (group of assays) An assay represents one MS run (in most cases). No data reprocessing at present. PRIDE aims to represent the author’s view on the data Supported formats: PRIDE XML and mzIdentML . Raw data is also now stored

What is a proteomics publication in 2015? Proteomics studies generate potentially large amount s of data and results. Ideally, a proteomics publication needs to: Summarize the results of the study Provide supporting information for reliability of any results reported Information in a publication: Manuscript Supplementary material Associated data submitted to a public repository

Journal Submission Recommendations Journal guidelines recommend submission to proteomics repositories: Proteomics Nature Biotechnology Nature Methods Molecular and Cellular Proteomics Funding agencies are enforcing public deposition of data to maximize the value of the funds provided.

PRIDE: Source of MS proteomics data PRIDE Archive already provides or will soon provide MS proteomics data to other EMBL-EBI resources such as UniProt , Ensembl and the Expression Atlas . http:// www.ebi.ac.uk /pride

ProteomeXchange Consortium Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories. Includes PeptideAtlas (ISB, Seattle), PRIDE (Cambridge, UK) and (very recently) MassIVE (UCSD, San Diego) . Common identifier space (PXD identifiers) Two supported data workflows: MS/MS and SRM . Main objective: Make life easier for researchers http :/ /www.proteomexchange.org

ProteomeCentral Metadata / Manuscript Raw Data* Results Journals UniProt/ neXtProt Peptide Atlas Other DBs Receiving repositories PASSEL (SRM data) PRIDE (MS/MS data) Other DBs GPMDB Researcher’s results Reprocessed results Raw data* Metadata MassIVE (MS/MS data) Vizcaíno et al. , Nat Biotechnol , 2014 ProteomeXchange data workflow

PRIDE Archive (in the context of ProteomeXchange and the PSI standards) How to submit data to PRIDE: PRIDE tools How to access data in PRIDE Archive A sneak peak to other PRIDE resources Overview

ProteomeCentral Metadata / Manuscript Raw Data* Results Journals UniProt/ neXtProt Peptide Atlas Other DBs Receiving repositories PASSEL (SRM data) PRIDE (MS/MS data) Other DBs GPMDB Researcher’s results Reprocessed results Raw data* Metadata MassIVE (MS/MS data) Vizcaíno et al. , Nat Biotechnol , 2014 ProteomeXchange data workflow

Complete Partial Complete vs Partial submissions: processed results For complete submissions, it is possible to connect the spectra with the identification processed results and they can be visualized.

Complete vs Partial submissions: experimental metadata Complete Partial General experimental metadata about the projects is similar. However, at the assay level information in partial submissions is not so detailed

How to perform a complete PX submission to PRIDE Decide between a complete/partial submission. File conversion/export to PRIDE XML or mzIdentML File check before submission (PRIDE Inspector) Experimental annotation and actual file submission (PX submission tool) Post-submission steps

PX Data workflow for MS/MS data Mass spectrometer output files : raw data (binary files) or peak list spectra in a standardized format ( mzML , mzXML ). Result files : Complete submissions : Result files can be converted to PRIDE XML or the mzIdentML data standard. Partial submissions : For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form. Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter. Other files: Optional files: QUANT: Quantification related results e. FASTA PEAK: Peak list files f. SP_LIBRARY GEL: Gel images OTHER: Any other file type Published Raw Files Other files

PX Data workflow for MS/MS data Mass spectrometer output files : raw data (binary files) or peak list spectra in a standardized format ( mzML , mzXML ). Result files : Complete submissions : Result files can be converted to PRIDE XML or the mzIdentML data standard. Partial submissions : For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form. Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter. Other files: Optional files (the list can be extended): QUANT: Quantification related results e. FASTA PEAK: Peak list files f. SP_LIBRARY GEL: Gel images OTHER: Any other file type Published Raw Files Other files

PRIDE Components: Submission Process PRIDE Converter 2 PRIDE Inspector PX Submission Tool mzIdentML PRIDE XML 1

Search output files Spectra files Original data files ‘RESULT’ file generation Final ‘RESULT’ file PRIDE XML ‘RESULT’ Before: only file conversion to PRIDE XML File conversion PRIDE Converter Other tools, e.g. hEIDI Barsnes et al. , Nat Biotechnol , 2009 Cote et al., MCP , 2012

Tools ‘RESULT’ file generation Final ‘RESULT’ file mzIdentML ‘RESULT’ Now: native file export to mzIdentML Spectra files ( mzML , mzXML , mzData , mgf , pkl , ms2, dta , apl ) Mascot ProteinPilot Scaffold PEAKS MSGF+ Others Native File export

Complete submissions Search Engine Results + MS files Search engines mzIdentML Mascot MSGF+ MyriMatch and related tools from D. Tabb’s lab OpenMS PEAKS PeptideShaker ProCon ( ProteomeDiscoverer , Sequest ) Scaffold TPP via the idConvert tool ( ProteoWizard ) ProteinPilot (from version 5.0) X!Tandem native conversion (Beta, PILEDRIVER) Others: library for X!Tandem conversion, lab internal pipelines, … Crux An increasing number of tools support export to mzIdentML 1.1 Referenced spectral files need to be submitted as well (all open formats are supported). Updated list: http://www.psidev.info/tools-implementing-mzIdentML# .

PRIDE Components: Submission Process PRIDE Converter 2 PRIDE Inspector PX Submission Tool mzIdentML PRIDE XML 2

PRIDE Inspector Toolsuite Wang et al. , Nat. Biotechnology , 2012 Perez- Riverol et al. , MCP, 2016, in press PRIDE Inspector PRIDE Inspector 2 supports: PRIDE XML mzIdentML + all types of spectra files mzML mzTab identification and Quantification https:// github.com /PRIDE- Toolsuite /

PRIDE Inspector 2 PRIDE Inspector 2 https:// github.com /PRIDE- Toolsuite / New visualisation functionality for Protein Groups

PRIDE Components: Submission Process PRIDE Converter 2 PRIDE Inspector PX Submission Tool mzIdentML PRIDE XML 3

Capture the mappings between the different types of files. Make the file upload process straightforward to the submitter (It transfers all the files using Aspera or FTP). PX submission tool Published Raw Other files http:// www.proteomexchange.org /submission PX submission tool Command line alternative: Using the Aspera file transfer protocol.

PX submission tool: screenshots

Fast file transfer with Aspera - Aspera is the default file transfer protocol to PRIDE: - PX Submission tool - Command line - Up to 50X faster than FTP File transfer speed should not be a problem!!

Manuscript published detailing the process Ternent et al. , Proteomics , 2014 http://www.proteomexchange.org/submission Example dataset: PXD000764 - Title: “Discovery of new CSF biomarkers for meningitis in children” - 12 runs: 4 controls and 8 infected samples - Identification and quantification data

PRIDE Archive: Number of submitted datasets in 2015

ProteomeXchange: 2,774 datasets up until 1 st September, 2015 Type: 1681 PRIDE partial 813 PRIDE complete 173 MassIVE 84 PeptideAtlas /PASSEL complete 23 Reprocessed Publicly Accessible: 1372 datasets, 49% of all 90% PRIDE 6% PASSEL 4% MassIVE Data volume: Total: ~150 TB Number of all files: ~400,000 PXD000320-324: ~ 4 TB PXD002319-26 ~2.4 TB PXD001471 ~1.6 TB Datasets/year: 2012: 102 2013: 527 2014: 963 2015: 1182 Top Species studied by at least 20 datasets: 1080 Homo sapiens 335 Mus musculus 110 Saccharomyces cerevisiae 98 Arabidopsis thaliana 75 Rattus norvegicus 58 Escherichia coli 29 Bos taurus 23 Glycine max 20 Caenorhabditis elegans 20 Oryza sativa ~ 500 species in total Origin: 714 USA 313 Germany 252 United Kingdom 163 China 146 France 121 Netherlands 108 Switzerland 103 Canada 81 Denmark 73 Spain 68 Japan 67 Australia 63 Sweden 57 Belgium 43 Austria 39 India 34 Taiwan 33 Norway 26 Italy 24 Ireland 24 Finland 21 Republic of Korea 20 Brazil 20 Russia 18 Israel 18 Singapore …

Public data release: when does it happen? When the author tells us to do it (the authors can do it by themselves) When we find out that a dataset has been published We look for PXD identifiers in PubMed abstracts . If your PXD identifier is not in the abstract, a paper may have been published and the data is still private. Let us know! New web form in the PRIDE web to facilitate the process

Partial submissions can be used to store other data types Everything can be stored, not only MS/MS data: very flexible mechanism to be able to capture all types of datasets PRIDE does not store SRM data (it goes to PASSEL) Top down proteomics datasets. Mass Spectrometry Imaging datasets. Data independent acquisition techniques: e.g. SWATH-MS datasets .

C D From original publication [13] Reconstructed ProteomeXchange data Thermo RAW data / UDP Mirion Software (JLU) Thermo RAW data / UDP Convert to imzML Upload to PRIDE (EBI, Cambridge, UK) Download from PRIDE Display in MSiReader Vendor-independent data format Freely available software (open source) ‘open data‘ – free to reuse Anybody can do this!  A public repository for mass spectrometry imaging data Römpp et al., 2015 PRIDE database European Bioinformatics Institute, Cambridge, UK 3. Upload 4. Download No file size limit !

PRIDE Archive (in the context of ProteomeXchange and the PSI standards) How to submit data to PRIDE: PRIDE tools How to access data in PRIDE Archive PRIDE Cluster and PRIDE Proteomes Overview

Data access to PRIDE Archive Look for particular datasets of interest: For data reuse : which particular proteins and peptides (including PTMs) have been detected. Data reinterpretation or re-analysis. Validation of the experimental results reported. S pecific use cases for proteomics : spectral libraries, fragmentation models, SRM transitions,…

ProteomeCentral Metadata / Manuscript Raw Data* Results Journals UniProt/ neXtProt Peptide Atlas Other DBs Receiving repositories PASSEL (SRM data) PRIDE (MS/MS data) Other DBs GPMDB Researcher’s results Reprocessed results Raw data* Metadata MassIVE (MS/MS data) Vizcaíno et al. , Nat Biotechnol , 2014 ProteomeXchange data workflow

ProteomeCentral : Portal for all PX datasets http:// proteomecentral.proteomexchange.org/cgi/GetDataset

RSS feed for public datasets http://groups.google.com/group/proteomexchange/feed/rss_v2_0_msgs.xml

Ways to access data in PRIDE Archive PRIDE web interface File repository REST web service PRIDE Inspector tool

PRIDE Archive web interface

PRIDE Archive web interface (2)

Vaudel M, Barsnes H, Berven FS, Sickmann A, Martens L: Proteomics 2011;11(5):996-9. https:// github.com / compomics / searchgui https:// github.com / compomics /peptide-shaker Vaudel M, Burkhart J, Zahedi RP, Berven FS, Sickmann A, Martens L, Barsnes H: Nature Biotechnology 2015; 33(1):22-24. CompOmics Open Source Analysis Pipeline

Find the desired PRIDE project … … and start re-analyzing the data! … inspect the project details …. Reshake PRIDE data!

PRIDE Archive (in the context of ProteomeXchange and the PSI standards) How to submit data to PRIDE: PRIDE tools How to access data in PRIDE Archive PRIDE Cluster and PRIDE Proteomes Overview

PRIDE resources

PRIDE Archive Aggregation PRIDE Cluster Basic QC checks for PSMs Reprocessed datasets Original Submissions Link to the original evidence For original results PRIDE Proteomes

Sneak peak Provide an aggregated and QC filtered peptide-centric and protein centric view on PRIDE Archive data. http:// www.ebi.ac.uk /pride/cluster/ http:// wwwdev.ebi.ac.uk /pride/proteomes/

PRIDE Cluster - Concept Use spectral clustering to reliably group spectra coming from the same peptide Infer reliable identifications by comparing submitted identifications of spectra within a cluster Increases quality through data increase (taking advantage of the wealth of data in PRIDE). Inherently adapts to new (labelling) techniques Griss et al. , Nat Methods , 2013

PRIDE Cluster - Concept Griss et al. , Nat Methods , 2013 NMMAACDPR NMMAACDPR PPECPDFDPPR NMMAACDPR Consensus spectrum PPECPDFDPPR NMMAACDPR NMMAACDPR Threshold: At least 10 spectra in a cluster and ratio >70%. Originally submitted identified spectra

PRIDE Cluster Home page http:// www.ebi.ac.uk /pride/cluster/#/

PRIDE Cluster: result of searches http:// www.ebi.ac.uk /pride/cluster/#/ A couple of examples …

Examples: one perfect cluster 880 PSMs give the same peptide ID 4 species 28 datasets Same instruments

Examples: one perfect cluster (2)

Examples: one perfect cluster (3) What does that peptide sequence correspond to?

Examples: very good cluster

Examples: very good cluster (2)

Examples: one perfect cluster (3) What does that peptide sequence correspond to?

PRIDE Cluster – Spectral libraries http://www.ebi.ac.uk/pride/cluster/#/libraries

PRIDE Proteomes: reusing PRIDE Cluster data Condensed and cross-dataset view of PRIDE Archive for identification data : D ata filtering of PSMs is performed at the level of the submitted data. PSMs are grouped as peptide sequences. The peptide sequences are remapped to a recent version of UniProtKB (at present UniProtKB “complete proteome”). Linked to the original supporting evidence. “PRIDE Cluster” used as an extra evidence for the PSMs. http:// wwwdev.ebi.ac.uk /pride/proteomes/

PRIDE: Using it for giving reliability to IDs Link to PRIDE Cluster web http:// wwwdev.ebi.ac.uk /pride/proteomes/

Examples: one perfect cluster 880 PSMs give the same peptide ID 4 species 28 datasets Same instruments

Main characteristics of PRIDE Archive and ProteomeXchange PX/PRIDE submission workflow for MS/MS data PRIDE Inspector PX submission tool PRIDE/ ProteomeXchange has become the de facto standard for data submission and data availability in proteomics PRIDE Proteomes and PRIDE Cluster: new resources Conclusions

Do you want to know a bit more…? http:// www.slideshare.net / JuanAntonioVizcaino

Aknowledgements : People Attila Csordas Tobias Ternent Noemi del Toro Johannes Griss Yasset Perez- Riverol Henning Hermjakob All past team members, especially Rui Wang, Florian Reisinger and Jose A. Dianes All ProteomeXchange partners, especially Eric Deutsch and Nuno Bandeira Acknowledgements: The PRIDE Team and collaborators

Questions?