Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

torstenseemann 3,183 views 38 slides May 23, 2015

Slide 1 of 38

About This Presentation

Using Snippy to call variants in bacterial short read datasets via alignment to reference, and then using these alignments to produce core SNP alignments for phylogenomics.

Size: 1.89 MB

Language: en

Added: May 23, 2015

Slides: 38 pages

Slide Content

Snippy
Torsten Seemann
Balti & Bioinformatics - Birmingham, UK - Tue 5 May 2015
Rapid bacterial variant calling
& core genome alignments

Background

(Far) south east England

Phyloflagomics
UK / Birmingham Australia / Victoria Canada / British Columbia

A new home
Centre for Applied
Microbial Genomics

Microbiological Diagnostic Unit
∷Oldest public health lab in Australia
:established 1897 in Melbourne
:large historical isolate collection back to 1950s
∷National reference laboratory
:Salmonella, Listeria, EHEC
∷WHO regional reference lab
:vaccine preventable invasive bacterial pathogens

New director
∷Professor Ben Howden
:clinician, microbiologist, pathologist
:early adopter of genomics and bioinformatics
:long term collaborator on MRSA/VRE w/ Tim Stinear

∷Mandate
:modernise service delivery
:enhance research output and collaboration
:nationally lead the conversion to WGS

Hardware
∷Sequencers
:NextSeq 500
:3 x MiSeq
:PacBio RS II (arriving 22 May)
∷Robots
:Perkin Elmer (does not have a Twitter account)
:Colony picker
∷Compute
:240 TB, 10 GigE, 3 x 72 core boxes

Variant calling

Variant calling
∷Find DNA differences between genomes
:variants to explain phenotype
:validate your complemented mutant

∷Two approaches
:reference based (read alignment)
:reference-free (de novo assembly / k-mer based)

Types of variants
∷Substitutions
:single nucleotide polymorphism (snp)A➝C
:multiple nucleotide polymorphism (mnp)AG➝TC
∷Indels
:insertion (ins)A➝AC
:deletion (del) ACCG➝AG
∷Complex
:compound eventsAC➝T

My solution

Snippy
∷Fast → snappy
∷Finds variants → SNPs
∷Australian → Skippy the bush kangaroo

Input
∷FASTQ files
:paired end, interleaved, or single-end

∷Reference
:FASTA or Genbank

∷Output folder
:self contained bundle of results

Inside the black box
∷bwa mem - no clipping needed
∷samtools - sorted, filtered BAM
∷freebayes - split / GNU parallel / merge
∷vcflib/vcftools - VCF filtering
∷perl - glue

Outputs
∷Read alignments
:.bam / .bai
∷Variants
:.vcf / .vcf.gz / .vcf.gz.tbi / .gff .bed .tab .csv .html
∷Consensus
:reference with all variants applied to it
∷Genome alignment
:reference with “-” (missing) and “N” low depth

TAB output
CHROM POS TYPE REF ALT EVIDENCE FTYPE STRAND NT_POS AA_POS LOCUS_TAG GENE PRODUCT
chr 5958 snp A G G:44 A:0 CDS + 41/600 13/200 ECO_0001 dnaA replication protein
DnaA
chr 35524 snp G T T:73 G:1 C:1 tRNA -
chr 45722 ins ATT ATTT ATTT:43 ATT:1 CDS - ECO_0045 gyrA DNA gyrase
chr 100541 del CAAA CAA CAA:38 CAAA:1 CDS + ECO_0179 hypothetical protein
plas 619 complex GATC AATA GATC:28 AATA:0
plas 3221 mnp GA CT CT:39 CT:0 CDS + ECO_p012 rep hypothetical protein

Phylogenomics

Phylogenetics 101
∷Choose some genes
∷Sequence each gene from each isolate
∷Align the protein sequences of each gene
∷Back-align to nucleotide space
∷Concatenate all the alignments
∷Construct a distance matrix (many ways)
∷Draw a tree (many ways)
∷Make wild inferences from little data

Phylogenomics 101
∷Assemble each genome

∷Perform whole genome alignment
:in nucleotide space, as don’t know what is coding
:very computationally expensive
:can’t parallelize as with individual genes

∷Continue as for phylogenetics

bug1 GATTACCAGCATTAAGG-TTCTCCAATC
bug2 GAT---CTGCATTATGGATTCRNCATTC
bug3 G-TTACCAGCACTAA-------CCAGTC

∷Ideally, feed this directly to a tree builder
∷Properly model gaps, codons and ambiguity
∷Hard!
Whole genome alignment

Core genome SNPs

bug1 GATTACCAGCATTAAGG-TTCTCCAATC
bug2 GAT---CTGCATTATGGATTCRNCATTC
bug3 G-TTACCAGCACTAA-------CCAGTC
core | | ||||||||| ||||||

Core sites are present in all genomes.
Core genome

bug1 GATTACCAGCATTAAGG-TTCTCCAATC
bug2 GAT---CTGCATTATGGATTCRNCATTC
bug3 G-TTACCAGCACTAA-------CCAGTC
core | | ||||||||| ||||||
SNPs | | | | |

Core SNPS = polymorphic sites in core genome
Core SNPs

bug1 GATTACCAGCATTAAGG-TTCT CCAATC
bug2 GAT---CTGCATTATGGATTCR NCATTC
bug3 G-TTACCAGCACTAA------- CCAGTC
core | | ||||||||| ||||||
SNPs | | | | |
SNPs’ | | | |

Unambiguous core SNPs

bug1 GATTACCAGCATTAAGG-TTCTCCAATC
bug2 GAT---CTGCATTATGGATTCRNCATTC
bug3 G-TTACCAGCACTAA-------CCAGTC
SNPs’ | | | |
ata ttc ata atg
1 2 3 4

Allele sites

>bug1
ATAA
>bug2
TTTT
>bug3
ACAG
Alignment ⇢Tree
+------ bug3
|
---+--- bug1
|
+--------- bug2

--- 1 SNP

The N±1 problem

Aligning to reference
∷Why is whole genome alignment not used?
:involves genome (mis)assembly
:computationally difficult
:expensive to add or remove isolates

∷Short-cut
:choose a single reference
:align each isolates reads to the reference
:core, by definition, must include the reference

Read mapping considerations
∷Choice of reference

∷Too divergent?
:reads may not align well
:will get too many core genome SNPs

∷ One solution
:Assemble one isolate and use as the reference

SNPs | | | | |
core | | ||||||||| ||||||
bug1 GATTACCAGCATTAAGG-TTCTCCAATC
bug2 GAT---CTGCATTATGGATTCRNCATTC
bug3 G-TTACCAGCACTAA-------CCAGTC
core1 ||| ||||||||||| ||||||||||
SNPs1 | | || |

Remove taxon, different core (1)

SNPs | | | | |
core | | ||||||||| ||||||
bug1 GATTACCAGCATTAAGG-TTCTCCAATC
bug2 GAT---CTGCATTATGGATTCRNCATTC
bug3 G-TTACCAGCACTAA-------CCAGTC
core2 | | ||||||||| ||||||
SNPs2 | | | | |

Remove taxon, different core (2)

SNPs | | | | |
core | | ||||||||| ||||||
bug1 GATTACCAGCATTAAGG-TTCTCCAATC
bug2 GAT---CTGCATTATGGATTCRNCATTC
bug3 G-TTACCAGCACTAA-------CCAGTC
core3 | ||||||||||||| ||||||
SNPs3 | |

Remove taxon, different core (3)

Core genome alignments
∷Core SNP alignments
:can shift dramatically with taxa content
:we are only using globally conserved sites
:remember variation still exists outside “core”

∷Snippy will keep the full alignments
:quickly derive subsets on the fly
:adding isolates can be done quickly too

Conclusion

Snippy summary
∷The good
:Fast, scales to 100 cores
:Simple, clean interface and output

∷The bad
:Doesn’t do full consequences yet using snpEff

∷The ugly?
:Written in Perl

Contact
∷tseemann.github.io

∷github.com/tseemann/snippy

∷@torstenseemann

Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx