1 The University of Sydney, Camperdown, Sydney, NSW, 2050, Australia
2 Intersect, Level 13/50 Carrington St, Sydney, NSW, 2000, Australia
3 Queensland Cyber Infrastructure Foundation Ltd, Axon Building 47, University of Queensland, St Lucia, Brisbane, QLD, 4072, Australia
4 Children’s Medical Research Institute, Westmead, NSW, 2145, Australia
Background
The ACRF International Centre for the Proteome of Cancer (ProCan) at Children’s Medical Research Institute (CMRI) is an “industrial scale” program specialising in small-sample proteomics analysis from human cancer tissue.
ProCan seeks to generate both a wide and deep analytics pipeline and requires an enabling data framework. The framework must accommodate initial analysis and proteomic profiling of a large number of tumor samples, along with the clinical and demographic information, subsequent multi-omics studies, and any previously recorded responses to treatment. The curated datasets will provide a valuable resource beyond their primary use and ProCan is committed to making its data accessible to collaborators and the wider scientific community.
Objectives
The objective of the project is to an establish efficient, reliable, secure and ethical data sharing and publication framework based on the best practice data sharing principles, such as the FAIR principle. The framework must address various challenges that stem from the scale and complexity of the program, and ProCan’s focus on human-derived data and associated challenges presented in sharing these data while maintaining the privacy of any research participants.
Method
The project adopted a requirements-driven methodology and engaged with a wide range of ProCan stakeholders nationally and internationally. Together, various industrial-scale proteomics data management and sharing scenarios were explored such that robust and ethical sharing of the data would be achieved.
Results
The project developed a data sharing framework based on the FAIR principle that currently forms the basis of ongoing implementation work within the ProCan program.
Size: 624.71 KB
Language: en
Added: Feb 18, 2019
Slides: 17 pages
Slide Content
A FAIR Data Sharing Framework for
Large-Scale Human Cancer Proteogenomics
Brett Tully
28 Nov 2018
@brett_tully [email protected]
Islam M
1,2
, Christiansen J
3
, Mahboob S
4
, ValovaV
4
, Baker
M
4
, Capes-Davis D
4
, HainsP
4
, BalleineR
1,4
, Zhong Q
1,4
,
ReddelR
1,4
, Robinson P
1,4
, Tully B
4
Big-Data Approach to Clinical Decision Making
Delivering molecular data to cancer clinicians,
in a clinically-relevant time frame,
to maximisethe accuracy of treatment decisions
Complex Project; Many Moving Parts
Roger Reddel Phil Robinson
Co-Directors
Brett Tully
Software
Engineering
Rosemary Balleine
Cancer
Pathology
Qing Zhong
Cancer Data
Science
Peter Hains
Cancer
Proteomics
ProCan Data Lake Aggregates Many Sources
FAIR Data Sharing in ProCan
Findable
Accessible
Interoperable
Reusable
Accelerates scientific discovery
Enhances integrity, transparency,
and reproducibility
FAIR Data Sharing in ProCan
Findable
by both humans and machines
•Discoverable, well-defined metadata
•Persistent unique URLs, or Document Identifiers (DOIs)
•Machine- readable metadata
ProCan Challenges
•Unique IDs and machine readable metadata are easy to create
•Human discoverability is much more difficult context dependent
FAIR Data Sharing in ProCan
Accessible
using standard protocols
•Data retrievable by their unique identifier
•Open, free, and implementable protocol
•Can be subject to constraints: ethical, privacy, security, commercial
ProCan Challenges
•Integration of many domains: pathology, LIMS, multi-omics, analytics
•Integration of 100’s collaborators: each with different agreements
•Sustainable funding: on-going costs for long- term storage & access
FAIR Data Sharing in ProCan
Interoperable
with other systems and data resources
•Industry/community standard formats & vocabularies
•Where possible, data accessible in open- formats
•Minimal intervention required to combine with 3
rd
party data
ProCan Challenges
•Proteomics data largely produced in proprietary vendor formats
•Pan-cancer = cross- discipline vocabularies = complex ontology
FAIR Data Sharing in ProCan
Reusable
and reproducible via richly described metadata
•Clear and accessible usage license
•Fully described provenance to community- defined standard
•Completeness meeting community- defined expectations
ProCan Challenges
•Existing public repositories; context-specific, and non- overlapping
•Not all data generated internally; dependent on 3
rd
party processes
Proposed FAIR Shared Responsibility Framework
Data Custodian (DC)
Sample Provider/Collaborator
ProCan
Data Management
Access Management
Publication
Interoperability & Reusability
Data Quality
(Meta)Data standardisation
Hosting Institutes &
Data Steward (ProCan)
Hosting Institute
Children’s Medical Research Institute (ProCan)
Collaborator’s Institute
International Repositories
Fit-for-purpose Infrastructure
Authentication
Data Storage
Compute
Retention
Discovery Services
Data Submission
Transfer Protocols
Proposed Risk-based Access, Sharing and Governance Model
Non-Sensitive Sensitive
Low Moderate HighestGovernance
Access
Open Registered Controlled
High
Registered
Low
Controlled
Most Trusted
Least TrustedUser
Data
Proposed Risk-based Access, Sharing and Governance Model
Non-Sensitive Sensitive
LowGovernance
Access
Open
Most Trusted
Least Trusted
QA & QC Data
Published research output
Analyseddata + minimum metadata
User
Data
Proposed Risk-based Access, Sharing and Governance Model
Non-Sensitive Sensitive
Governance
Access Registered
Low
Most Trusted
Least TrustedUser
Data
Lightlyaggregated data
De-identified analyseddata
Non-identifiable clinical data
Moderatelyaggregated data
De-identified analyseddata
Non-identifiable clinical data
Highlyaggregated data
De-identified analyseddata
Non-identifiable clinical & demographic data
Proposed Risk-based Access, Sharing and Governance Model
Non-Sensitive Sensitive
ModerateGovernance
Access
Lightlyaggregated data
De-identified analyseddata
Re-identifiable clinical data for this dataset only
Moderatelyaggregated data
De-identified analyseddata
Re-identifiable minimum clinical data for analysis
Sample verification and quality information
De-identified tissue sample and related metadata
Re-identifiable minimum clinical & demographic data for analysis
Registered
Most Trusted
Least TrustedUser
Data
Proposed Risk-based Access, Sharing and Governance Model
Non-Sensitive Sensitive
Governance
Access
Lightlyaggregated data
Genetics & multi-omics data
Re-identifiable clinical data
Moderatelyaggregated data
Genomics & multi-omics data
Re-identifiable clinical data
Highlyaggregated genomics & multi-omics data
Commercial and pharmaceutical data
Re-identifiable clinical & demographic data
High
Controlled
Most Trusted
Least TrustedUser
Data
Proposed Risk-based Access, Sharing and Governance Model
Non-Sensitive Sensitive
HighestGovernance
Access Controlled
Clinical decision support system –non-personal aggregated data
Re-identifiable aggregated clinical data
Clinical decision support system
Re-identifiable clinical & demographic data
Clinical decision support system
Clinical & demographic data Most Trusted
Least TrustedUser
Data
Acknowledgements
Co-Directors:Prof. Roger Reddel& Prof. Phil Robinson
Pathology: Prof. Rosemary Balleine& team
Proteomics: DrPeter Hains& team
SoftwareEngineering: DrBrett Tully & team
DataScience: DrQing Zhong & team
*Intersect: DrMohammad Islam, DrJeff Christiansen