Improving accuracy and efficiency leveraging a pangenome reference
jackdigiovanna
49 views
37 slides
Oct 15, 2024
Slide 1 of 37
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
About This Presentation
Overview of why pangenome references can make a difference to patients and drug developers, framed around Tumor Mutational Burden (TMB) in under-represented populations.
Key metrics we consider:
1. Enhanced Genomic Accuracy: : Learn how better sequencing read alignment leads to more accurate variant...
Overview of why pangenome references can make a difference to patients and drug developers, framed around Tumor Mutational Burden (TMB) in under-represented populations.
Key metrics we consider:
1. Enhanced Genomic Accuracy: : Learn how better sequencing read alignment leads to more accurate variant calling, particularly in complex regions like MHC.
2. Efficiency Gains: Understand how these technologies cut computational costs and time, removing the need for expensive joint calling.
3. Compatibility: These methods fully integrate with existing toolsets.
We highlight the nine different reference populations that are available along with our global pangenome reference.
Finally, we suggest a proof of concept to show how these methods perform with your data.
2
Accurately estimating
Tumor Mutational Burden
canextend patient lives,
especially for under-
represented populations.
3
"The immune checkpoint inhibitor (ICI)
pembrolizumab is US FDA approved for treatment of
solid tumors with high tumor mutational burden
(TMB-high; ≥10 variants/Mb)."
4
TMB is often overestimated in non-European populations.
Image credit: Nassar, AH, Adib, E, Alaiwi SA,. et al., Cancer Cell (2022)
True somatic variant
Germline false positive
Germline filtered
5
▪These patients are
disproportionally
African American
or Asian.
TMB estimated high, but is low
Patients with over-estimated TMB have poor
response to checkpoint inhibitors.
Image credit: Nassar, AH, Adib, E, Alaiwi SA,. et al., Cancer Cell (2022)
TMB is high
TMB is low
6
▪These patients are
disproportionally
African American
or Asian.
TMB estimated high, but is low
Patients with over-estimated TMB have poor
response to checkpoint inhibitors.
Image credit: Nassar, AH, Adib, E, Alaiwi SA,. et al., Cancer Cell (2022)
TMB is high
TMB is low
Precision medicine is creating amazing
outcomesfor some patients.
Karczewski, Konrad J., et al.
"The mutational constraint
spectrum quantified from
variation in 141,456 humans."
Nature (2020)
9
Introducing pangenome-aware secondary NGS data analysis
Reference
bias
1
1. Human Pangenome Reference Consortium
Canonical Linear Reference genomes are now 20+ years old,
and do not account for all genomic variation
Read mapping against linear references leads to reference
bias
This systemic blindness is a profound barrier to accurate
genomic data analysis, especially in non-European
populations.
Pioneering
graph
genomes
2
Velsera pioneered graph reference based secondary analysis
technology since 2014
The patented technology is now available as robust, scalable and
validated pipelines ready for deployment in drug development
The draft human pangenome reference presented by the HPRC
1
has generated more awareness on how to solve reference bias
problem
State-of-the-art graph and pangenome based sequence analysis is
now becoming the standard.
Better alignment creates better outcomes.
15
GRAF excels in accuracy, efficiency,
& compatibility.
1
2
3
Best-in-class
accuracy
Cost-
effectiveness
Use less compute than traditional linear reference
pipelines and currently established standards
Obviate the need for costly joint calling & post-
processing steps such as VSQR
State-of-the-art accuracy across variant types,
across WGS, WES and targeted assays
Increased variant yield across SNPs, InDels, and
structural variants, while reducing false positives
Compatibility
and no-friction
deployment
Drop-in replacement for your existing pipelines
Utilize only established file formats and standards
(BAM, VCF, BED…)
Hosted and on-prem deployed solutions
Compatible with your existing workflows
GRAF is exceptional
atdetecting SVs.
10 /1 5/2 024 16
GRAF excels at calling structural variation
finding 10-20x longer insertions and
deletions with >3x better recall
GRAF identifies short and long structural
variants with one single algorithm, in one
single pass - with best-in-classaccuracy
acrossthe board
GRAF is exceptional
atdetecting SVs.
10 /1 5/2 024 17
GRAF excels at calling structural variation
finding 10-20x longer insertions and
deletions with >3x better recall
GRAF identifies short and long structural
variants with one single algorithm, in one
single pass - with best-in-classaccuracy
acrossthe board
10 /1 5/2 024 18
Performance improves by using representative
pangenome references.
SV COUNTS
African Brazilian Native American
ALIGNER REFERENCE
10 /1 5/2 024 19
Performance improves by using representative
pangenome references.
SV COUNTS
African Brazilian Native American
ALIGNER REFERENCE
GRAF performs well in
challenging medically-relevant
genes.
10 /1 5/2 024 20
Evaluated in portions of Genome in a Bottle
datasets with CMRG.
GRAF workflow discovers significantly
more indels than other solutions.
False positive rate is low, further
optimizing variant yield and reducing
downstream analytical work.
Best in class in calling the MHC.
10 /1 5/2 024 21
The major histocompatibility complex (MHC)
region is one of the most polymorphic regions of
the human genome;
•genes play a vital role in the immune system
•complex structural variations and point
variants
•Highly polymorphic and difficult to map
More precise TMB calling with GRAF from tumor-only samples.
22
Analysis of 8 random WES samples from Nassar, Adib, & Alaiwi, et al. study
10 /1 5/2 024
TMB Scores
0
10
20
30
40
GRCh38
GRAF
Tumor only
available
Tumor +
Normal
More precise TMB calling with GRAF from tumor-only samples.
23
Analysis of 8 random WES samples from Nassar, Adib, & Alaiwi, et al. study
10 /1 5/2 024
TMB Scores
0
10
20
30
40
GRCh38
GRAF
Tumor only
available
Tumor +
Normal
Tumor only
GRAF
0.8
1.0
1.2
1.4
1.6
1.8
GRAF-T GATK-T GRAF-TN GATK-TN
Normalized TMB (%)
Normalized TMB significantly different for GRAF vs GATK.
24
All values normalized to GATK-TN TMB estimate, n=8, mean +/- standard error
GATK-TGRAF-T
All results
normalized to
sample's GATK-
TN TMB Score
10 /1 5/2 024
0.8
1.0
1.2
1.4
1.6
1.8
Normalized TMB
GRAF-TN GATK-TN
*
25
GRAF excels in accuracy, efficiency,
& compatibility.
1
2
3
Best-in-class
accuracy
Cost-
effectiveness
Use less compute than traditional linear reference
pipelines and currently established standards
Obviate the need for costly joint calling & post-
processing steps such as VSQR
State-of-the-art accuracy across variant types,
across WGS, WES and targeted assays
Increased variant yield across SNPs, InDels, and
structural variants, while reducing false positives
Compatibility
and no-friction
deployment
Drop-in replacement for your existing pipelines
Utilize only established file formats and standards
(BAM, VCF, BED…)
Hosted and on-prem deployed solutions
Compatible with your existing workflows
GRAF is faster and cheaper than the field.
10 /1 5/2 024 26
Linear
reference
Pangenome
reference
GRAF leverages posterior probability without joint-calling.
Rescued and filtered GATK joint called variants
in African cohort
~12M joint-call rescued
~2M VQSR filtered
Pangenome references store allele frequencies
Enables calls based on posterior probability,
withoutjoint calling
GRAF retains sensitivity of traditional joint calling
~80% of variants rescued by GATK joint
genotyper are called by GRAF workflow
Pangenome results retain specificity gain
of complex filtering methods
~80% of variants filtered out by VQSR are not
called by GRAF workflow
Obviating joint calling and VQSR artifact filtering
drastically reduces compute requirements
GRAF is efficient and accurate in trio analysis.
Pangenome based workflow for accurate detection of congenital mutations in family trios.
Benchmarking shows comparable results to state-of-the-art workflows but at 1/3rd of the cost.
29
GRAF excels in accuracy, efficiency,
& compatibility.
1
2
3
Best-in-class
accuracy
Cost-
effectiveness
Use less compute than traditional linear reference
pipelines and currently established standards
Obviate the need for costly joint calling & post-
processing steps such as VSQR
State-of-the-art accuracy across variant types,
across WGS, WES and targeted assays
Increased variant yield across SNPs, InDels, and
structural variants, while reducing false positives
Compatibility
and no-friction
deployment
Drop-in replacement for your existing pipelines
Utilize only established file formats and standards
(BAM, VCF, BED…)
Hosted and on-prem deployed solutions
Compatible with your existing workflows
Upstream and downstream compatibility.
Standard file types and data formats
Compatible
✓ Standard data formats representing reads, variants, references, annotations
Standard genome coordinate system - works from linear reference backbones such as GRCh38
Standard file types you are already working with – FASTQ, CRAM, BAM, VCF
Linear reference
GRCh38 backbone
Pangenome reference
Sequencing data
FASTQ, CRAM, BAM
Read alignment Variant calling
Alignments w/ GRAF &
linear coordinates
BAM, CRAM
Variant calls with pangenome and
backbone annotations
VCF
31
User-friendly experience.
Example: Pangenome analysis
Pangenotype-based alignment
and calling
on Velsera Seven Bridges:
Ready-to-use, optimized
bioinformatics pipelines and
tools
32
Bioinformatics-ready experience.
Example: pangenome alignment visualization
Bioinformatics-ready:
programmatic interface and
fast, feature-rich notebook
environments for
bioinformatics-centric use
33
Pangenome references have been created from databases (e.g. gnomAD), iterative construction from 1K Genomes
Project, and Human Genome Structural Variation Consortium references.
What population pangenomes are available today?
African
American
(admixed)
East Asian
European
South Asian
Korean
Icelandic
Finnish
Native
American
Brazilian
…
Global
(~800k samples, 47
WGfrom HPRC)
34
Pangenome references have been created from databases (e.g. gnomAD), iterative construction from 1K Genomes
Project, and Human Genome Structural Variation Consortium references.
How are population references made?
African
American
(admixed)
East Asian
European
South Asian
Korean
Icelandic
Finnish
Native
American
Brazilian
…
Global
(~800k samples, 47
WGfrom HPRC)
SNPs & INDELs
gnomAD v3 (N=21,042)
Iterative construction using 1KGP African samples (N=520)
Structural Variants
HGSVC (N=10)
Iterative construction using 1KGP African samples (N=416)
35
Bring Your Own Data to prove GRAF's performance
in your hands.
35
Step
Deliverables
Kick-off
Test data on
prem or on
platform
1 week
Time span
Success criteria and
work plan defined
Expected effort Outputs
Cohort, data set
ready for analysis
Pipeline
execution,
reporting
2-3 weeks
Evaluation
meetings
1-2 weeks
Conclusions
Next steps /
follow-up actions
1
2
3
4
Result files
Reports, analysis,
metrics
Run data analysis pipelines,
documenting metrics and results,
technical interactions and support
meetings
Review and discussion, of results,
analysis of success criteria, outlining
next steps and collaboration
Data upload, data format and quality
review, formatting and staging
Kick-off meeting detailing PoC
execution and collaboration
Best in class results Improve efficiency Avoid missed insights
Find more true variants Represent population diversity
Address reference bias, increase variant yield.
Whether for panels, exomes, or whole
genomes.
For samples not well represented in classical
linear references, significantly increase variant
yield
Reduce false positive calls
Identify previously unseen variants
Optimize number of variants to be assessed
and reviewed.
Optimize investment in genotyping efforts by
analyzing your raw sequencing data with
state-of-the-art, best-in-class secondary
analysis pipeline
Improve structural variant calling
Collaborate with experienced
science team
Lower your computational cost
Avoid compute-intensive post-processing
quality checks because of inherent optimal
reference representation
Improve time to result
Compute-optimized algorithms with hosted
and on-prem deployments
Drop-in enhancement, no
changeover
Standard file formats, compatible with your
existing tools and workflows – a drop-in toolset
makes for frictionless adoption
Best in class structural variant calling with
capability to call far larger SVs
Collaborate with the team that pioneered this
technology, build specific references, and
deploy with help of a team of experts
You should run a POC using pangenome references.
37
Further reading for better understanding.
Nature Genetics (link)
We described our pangenome aware algorithms for analysis of sequencing samples, and presented results
showing improves read mapping sensitivity with a 0.5% increase in variant calling recall, at a lower
computational cost than standard methods.
Nature Communications (link)
Pangenome references represent diverse genetic information from different human populations, with the
goal of overcoming linear references'inability to maintain the same level of accuracy for non-European
ancestries.
Cell Genomics (link)
The precision FDA Truth Challenge V2 aimed to assess the state of the art of variant calling in challenging
genomic regions. GRAF scored best for accurate valiant calling in MHC region from Illumina short read
samples, which is critical for HLA typing.
bioRxiv preprint (link)
GRAF pangenome generated variant calls from proband-parent trios significantly improve the accuracy of
a consensus method for detection of de novo mutations, boosting both the sensitivity and specificity. This
fully automated consensus method will enable identification of rare disease associated mutations in
large family cohorts.