Improving accuracy and efficiency leveraging a pangenome reference

jackdigiovanna 49 views 37 slides Oct 15, 2024
Slide 1
Slide 1 of 37
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37

About This Presentation

Overview of why pangenome references can make a difference to patients and drug developers, framed around Tumor Mutational Burden (TMB) in under-represented populations.
Key metrics we consider:
1. Enhanced Genomic Accuracy: : Learn how better sequencing read alignment leads to more accurate variant...


Slide Content

January 2023
BioTechX EU
Jack DiGiovanna, PhD
Chief Science Officer
9 Oct 20247 Sept 2024
11 Sept24
© 2024 Velsera. All rights reserved. | www.velsera.com
Improving accuracy and
efficiency leveraging a
pangenome reference

2
Accurately estimating
Tumor Mutational Burden
canextend patient lives,
especially for under-
represented populations.

3
"The immune checkpoint inhibitor (ICI)
pembrolizumab is US FDA approved for treatment of
solid tumors with high tumor mutational burden
(TMB-high; ≥10 variants/Mb)."

4
TMB is often overestimated in non-European populations.
Image credit: Nassar, AH, Adib, E, Alaiwi SA,. et al., Cancer Cell (2022)
True somatic variant
Germline false positive
Germline filtered

5
▪These patients are
disproportionally
African American
or Asian.
TMB estimated high, but is low
Patients with over-estimated TMB have poor
response to checkpoint inhibitors.
Image credit: Nassar, AH, Adib, E, Alaiwi SA,. et al., Cancer Cell (2022)
TMB is high
TMB is low

6
▪These patients are
disproportionally
African American
or Asian.
TMB estimated high, but is low
Patients with over-estimated TMB have poor
response to checkpoint inhibitors.
Image credit: Nassar, AH, Adib, E, Alaiwi SA,. et al., Cancer Cell (2022)
TMB is high
TMB is low

Precision medicine is creating amazing
outcomesfor some patients.

Karczewski, Konrad J., et al.
"The mutational constraint
spectrum quantified from
variation in 141,456 humans."
Nature (2020)

9
Introducing pangenome-aware secondary NGS data analysis
Reference
bias
1
1. Human Pangenome Reference Consortium
Canonical Linear Reference genomes are now 20+ years old,
and do not account for all genomic variation
Read mapping against linear references leads to reference
bias
This systemic blindness is a profound barrier to accurate
genomic data analysis, especially in non-European
populations.
Pioneering
graph
genomes
2
Velsera pioneered graph reference based secondary analysis
technology since 2014
The patented technology is now available as robust, scalable and
validated pipelines ready for deployment in drug development
The draft human pangenome reference presented by the HPRC
1

has generated more awareness on how to solve reference bias
problem
State-of-the-art graph and pangenome based sequence analysis is
now becoming the standard.
Better alignment creates better outcomes.

10
GRAF improves alignment for known variants.
Example - using data from the whole genome sample
NA12878, both the graph aligner and the linear aligner
BWA identify a known 41bp insertion (rs141252781),
but only GRAF calls the correct zygosity.
© 2024 All rights reserved by Ve lsera
KNOWN VARIANTS
Thegraph aligner aligns significantly more reads and accurately
detects variants – e.g. correctly identifying a variant as a
homozygous insertion, while BWA erroneously detects it as
heterozygous.

11
GRAF improves alignment for complex variants.
IDENTIFYING COMPLEX VARIANTS
Graph alignment facilitates the discovery of in-phase variants,
and other types of complex events.
© 2024 All rights reserved by Ve lsera
KNOWN VARIANTS
Thegraph aligner aligns significantly more reads and accurately
detects variants – e.g. correctly identifying a variant as a
homozygous insertion, while BWA erroneously detects it as
heterozygous.
SNPs often occur near larger variantssuch as
insertions and deletions. SNPs are thus often
missed in these regions when reads contain large
mismatches.

12
GRAF improves alignment for compound variants.
COMPOUND VARIANTS
A graph reference easily places variants within existing variants to
facilitate alignment.
Compound variants are challenging for traditional alignment
and calling pipelines. GRAF is able to accurately call and
represent complex, compound variation.
© 2024 All rights reserved by Ve lsera
IDENTIFYING COMPLEX VARIANTS
Graph alignment facilitates the discovery of in-phase variants,
and other types of complex events.
KNOWN VARIANTS
Thegraph aligner aligns significantly more reads and accurately
detects variants – e.g. correctly identifying a variant as a
homozygous insertion, while BWA erroneously detects it as
heterozygous.

13
GRAF improves alignment for structural variants.
ACROSS STRUCTURAL AND POINT VARIANTS
Velsera’s GRAF solution provides data structures, algorithms and graph
references that effectively capture genomic diversity.
© 2024 All rights reserved by Ve lsera
COMPOUND VARIANTS
A graph reference easily places variants within existing variants to
facilitate alignment.
IDENTIFYING COMPLEX VARIANTS
Graph alignment facilitates the discovery of in-phase variants,
and other types of complex events.
KNOWN VARIANTS
Thegraph aligner aligns significantly more reads and accurately
detects variants – e.g. correctly identifying a variant as a
homozygous insertion, while BWA erroneously detects it as
heterozygous.

15
GRAF excels in accuracy, efficiency,
& compatibility.
1
2
3
Best-in-class
accuracy
Cost-
effectiveness
Use less compute than traditional linear reference
pipelines and currently established standards
Obviate the need for costly joint calling & post-
processing steps such as VSQR
State-of-the-art accuracy across variant types,
across WGS, WES and targeted assays
Increased variant yield across SNPs, InDels, and
structural variants, while reducing false positives
Compatibility
and no-friction
deployment
Drop-in replacement for your existing pipelines
Utilize only established file formats and standards
(BAM, VCF, BED…)
Hosted and on-prem deployed solutions
Compatible with your existing workflows

GRAF is exceptional
atdetecting SVs.
10 /1 5/2 024 16
GRAF excels at calling structural variation
finding 10-20x longer insertions and
deletions with >3x better recall
GRAF identifies short and long structural
variants with one single algorithm, in one
single pass - with best-in-classaccuracy
acrossthe board

GRAF is exceptional
atdetecting SVs.
10 /1 5/2 024 17
GRAF excels at calling structural variation
finding 10-20x longer insertions and
deletions with >3x better recall
GRAF identifies short and long structural
variants with one single algorithm, in one
single pass - with best-in-classaccuracy
acrossthe board

10 /1 5/2 024 18
Performance improves by using representative
pangenome references.
SV COUNTS
African Brazilian Native American
ALIGNER REFERENCE

10 /1 5/2 024 19
Performance improves by using representative
pangenome references.
SV COUNTS
African Brazilian Native American
ALIGNER REFERENCE

GRAF performs well in
challenging medically-relevant
genes.
10 /1 5/2 024 20
Evaluated in portions of Genome in a Bottle
datasets with CMRG.
GRAF workflow discovers significantly
more indels than other solutions.
False positive rate is low, further
optimizing variant yield and reducing
downstream analytical work.

Best in class in calling the MHC.
10 /1 5/2 024 21
The major histocompatibility complex (MHC)
region is one of the most polymorphic regions of
the human genome;
•genes play a vital role in the immune system
•complex structural variations and point
variants
•Highly polymorphic and difficult to map

More precise TMB calling with GRAF from tumor-only samples.
22
Analysis of 8 random WES samples from Nassar, Adib, & Alaiwi, et al. study
10 /1 5/2 024
TMB Scores
0
10
20
30
40
GRCh38
GRAF
Tumor only
available
Tumor +
Normal

More precise TMB calling with GRAF from tumor-only samples.
23
Analysis of 8 random WES samples from Nassar, Adib, & Alaiwi, et al. study
10 /1 5/2 024
TMB Scores
0
10
20
30
40
GRCh38
GRAF
Tumor only
available
Tumor +
Normal
Tumor only
GRAF

0.8
1.0
1.2
1.4
1.6
1.8
GRAF-T GATK-T GRAF-TN GATK-TN
Normalized TMB (%)
Normalized TMB significantly different for GRAF vs GATK.
24
All values normalized to GATK-TN TMB estimate, n=8, mean +/- standard error
GATK-TGRAF-T
All results
normalized to
sample's GATK-
TN TMB Score
10 /1 5/2 024
0.8
1.0
1.2
1.4
1.6
1.8
Normalized TMB
GRAF-TN GATK-TN
*

25
GRAF excels in accuracy, efficiency,
& compatibility.
1
2
3
Best-in-class
accuracy
Cost-
effectiveness
Use less compute than traditional linear reference
pipelines and currently established standards
Obviate the need for costly joint calling & post-
processing steps such as VSQR
State-of-the-art accuracy across variant types,
across WGS, WES and targeted assays
Increased variant yield across SNPs, InDels, and
structural variants, while reducing false positives
Compatibility
and no-friction
deployment
Drop-in replacement for your existing pipelines
Utilize only established file formats and standards
(BAM, VCF, BED…)
Hosted and on-prem deployed solutions
Compatible with your existing workflows

GRAF is faster and cheaper than the field.
10 /1 5/2 024 26
Linear
reference
Pangenome
reference

GRAF leverages posterior probability without joint-calling.
Rescued and filtered GATK joint called variants
in African cohort
~12M joint-call rescued
~2M VQSR filtered
Pangenome references store allele frequencies
Enables calls based on posterior probability,
withoutjoint calling
GRAF retains sensitivity of traditional joint calling
~80% of variants rescued by GATK joint
genotyper are called by GRAF workflow
Pangenome results retain specificity gain
of complex filtering methods
~80% of variants filtered out by VQSR are not
called by GRAF workflow
Obviating joint calling and VQSR artifact filtering
drastically reduces compute requirements

GRAF is efficient and accurate in trio analysis.
Pangenome based workflow for accurate detection of congenital mutations in family trios.
Benchmarking shows comparable results to state-of-the-art workflows but at 1/3rd of the cost.

29
GRAF excels in accuracy, efficiency,
& compatibility.
1
2
3
Best-in-class
accuracy
Cost-
effectiveness
Use less compute than traditional linear reference
pipelines and currently established standards
Obviate the need for costly joint calling & post-
processing steps such as VSQR
State-of-the-art accuracy across variant types,
across WGS, WES and targeted assays
Increased variant yield across SNPs, InDels, and
structural variants, while reducing false positives
Compatibility
and no-friction
deployment
Drop-in replacement for your existing pipelines
Utilize only established file formats and standards
(BAM, VCF, BED…)
Hosted and on-prem deployed solutions
Compatible with your existing workflows

Upstream and downstream compatibility.
Standard file types and data formats
Compatible
✓ Standard data formats representing reads, variants, references, annotations
Standard genome coordinate system - works from linear reference backbones such as GRCh38
Standard file types you are already working with – FASTQ, CRAM, BAM, VCF
Linear reference
GRCh38 backbone
Pangenome reference
Sequencing data
FASTQ, CRAM, BAM
Read alignment Variant calling
Alignments w/ GRAF &
linear coordinates
BAM, CRAM
Variant calls with pangenome and
backbone annotations
VCF

31
User-friendly experience.
Example: Pangenome analysis
Pangenotype-based alignment
and calling
on Velsera Seven Bridges:
Ready-to-use, optimized
bioinformatics pipelines and
tools

32
Bioinformatics-ready experience.
Example: pangenome alignment visualization
Bioinformatics-ready:
programmatic interface and
fast, feature-rich notebook
environments for
bioinformatics-centric use

33
Pangenome references have been created from databases (e.g. gnomAD), iterative construction from 1K Genomes
Project, and Human Genome Structural Variation Consortium references.
What population pangenomes are available today?
African
American
(admixed)
East Asian
European
South Asian
Korean
Icelandic
Finnish
Native
American
Brazilian

Global
(~800k samples, 47
WGfrom HPRC)

34
Pangenome references have been created from databases (e.g. gnomAD), iterative construction from 1K Genomes
Project, and Human Genome Structural Variation Consortium references.
How are population references made?
African
American
(admixed)
East Asian
European
South Asian
Korean
Icelandic
Finnish
Native
American
Brazilian

Global
(~800k samples, 47
WGfrom HPRC)
SNPs & INDELs
gnomAD v3 (N=21,042)
Iterative construction using 1KGP African samples (N=520)
Structural Variants
HGSVC (N=10)
Iterative construction using 1KGP African samples (N=416)

35
Bring Your Own Data to prove GRAF's performance
in your hands.
35
Step
Deliverables
Kick-off
Test data on
prem or on
platform
1 week
Time span
Success criteria and
work plan defined
Expected effort Outputs
Cohort, data set
ready for analysis
Pipeline
execution,
reporting
2-3 weeks
Evaluation
meetings
1-2 weeks
Conclusions
Next steps /
follow-up actions
1
2
3
4
Result files
Reports, analysis,
metrics
Run data analysis pipelines,
documenting metrics and results,
technical interactions and support
meetings
Review and discussion, of results,
analysis of success criteria, outlining
next steps and collaboration
Data upload, data format and quality
review, formatting and staging
Kick-off meeting detailing PoC
execution and collaboration

Best in class results Improve efficiency Avoid missed insights
Find more true variants Represent population diversity
Address reference bias, increase variant yield.
Whether for panels, exomes, or whole
genomes.
For samples not well represented in classical
linear references, significantly increase variant
yield
Reduce false positive calls
Identify previously unseen variants
Optimize number of variants to be assessed
and reviewed.
Optimize investment in genotyping efforts by
analyzing your raw sequencing data with
state-of-the-art, best-in-class secondary
analysis pipeline
Improve structural variant calling
Collaborate with experienced
science team
Lower your computational cost
Avoid compute-intensive post-processing
quality checks because of inherent optimal
reference representation
Improve time to result
Compute-optimized algorithms with hosted
and on-prem deployments
Drop-in enhancement, no
changeover
Standard file formats, compatible with your
existing tools and workflows – a drop-in toolset
makes for frictionless adoption
Best in class structural variant calling with
capability to call far larger SVs
Collaborate with the team that pioneered this
technology, build specific references, and
deploy with help of a team of experts
You should run a POC using pangenome references.

37
Further reading for better understanding.
Nature Genetics (link)
We described our pangenome aware algorithms for analysis of sequencing samples, and presented results
showing improves read mapping sensitivity with a 0.5% increase in variant calling recall, at a lower
computational cost than standard methods.
Nature Communications (link)
Pangenome references represent diverse genetic information from different human populations, with the
goal of overcoming linear references'inability to maintain the same level of accuracy for non-European
ancestries.
Cell Genomics (link)
The precision FDA Truth Challenge V2 aimed to assess the state of the art of variant calling in challenging
genomic regions. GRAF scored best for accurate valiant calling in MHC region from Illumina short read
samples, which is critical for HLA typing.
bioRxiv preprint (link)
GRAF pangenome generated variant calls from proband-parent trios significantly improve the accuracy of
a consensus method for detection of de novo mutations, boosting both the sensitivity and specificity. This
fully automated consensus method will enable identification of rare disease associated mutations in
large family cohorts.

Thank you.
[email protected]
[email protected]
Booth #1038
BioTechX EU