Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

torstenseemann 3,183 views 38 slides May 23, 2015
Slide 1
Slide 1 of 38
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38

About This Presentation

Using Snippy to call variants in bacterial short read datasets via alignment to reference, and then using these alignments to produce core SNP alignments for phylogenomics.


Slide Content

Snippy
Torsten Seemann
Balti & Bioinformatics - Birmingham, UK - Tue 5 May 2015
Rapid bacterial variant calling
& core genome alignments

Background

(Far) south east England

Phyloflagomics
UK / Birmingham Australia / Victoria Canada / British Columbia

A new home
Centre for Applied
Microbial Genomics

Microbiological Diagnostic Unit
∷Oldest public health lab in Australia
:established 1897 in Melbourne
:large historical isolate collection back to 1950s
∷National reference laboratory
:Salmonella, Listeria, EHEC
∷WHO regional reference lab
:vaccine preventable invasive bacterial pathogens

New director
∷Professor Ben Howden
:clinician, microbiologist, pathologist
:early adopter of genomics and bioinformatics
:long term collaborator on MRSA/VRE w/ Tim Stinear

∷Mandate
:modernise service delivery
:enhance research output and collaboration
:nationally lead the conversion to WGS

Hardware
∷Sequencers
:NextSeq 500
:3 x MiSeq
:PacBio RS II (arriving 22 May)
∷Robots
:Perkin Elmer (does not have a Twitter account)
:Colony picker
∷Compute
:240 TB, 10 GigE, 3 x 72 core boxes

Variant calling

Variant calling
∷Find DNA differences between genomes
:variants to explain phenotype
:validate your complemented mutant

∷Two approaches
:reference based (read alignment)
:reference-free (de novo assembly / k-mer based)

Types of variants
∷Substitutions
:single nucleotide polymorphism (snp)A➝C
:multiple nucleotide polymorphism (mnp)AG➝TC
∷Indels
:insertion (ins)A➝AC
:deletion (del) ACCG➝AG
∷Complex
:compound eventsAC➝T

My solution

Snippy
∷Fast → snappy
∷Finds variants → SNPs
∷Australian → Skippy the bush kangaroo

Input
∷FASTQ files
:paired end, interleaved, or single-end

∷Reference
:FASTA or Genbank

∷Output folder
:self contained bundle of results

Inside the black box
∷bwa mem - no clipping needed
∷samtools - sorted, filtered BAM
∷freebayes - split / GNU parallel / merge
∷vcflib/vcftools - VCF filtering
∷perl - glue

Outputs
∷Read alignments
:.bam / .bai
∷Variants
:.vcf / .vcf.gz / .vcf.gz.tbi / .gff .bed .tab .csv .html
∷Consensus
:reference with all variants applied to it
∷Genome alignment
:reference with “-” (missing) and “N” low depth

TAB output
CHROM POS TYPE REF ALT EVIDENCE FTYPE STRAND NT_POS AA_POS LOCUS_TAG GENE PRODUCT
chr 5958 snp A G G:44 A:0 CDS + 41/600 13/200 ECO_0001 dnaA replication protein
DnaA
chr 35524 snp G T T:73 G:1 C:1 tRNA -
chr 45722 ins ATT ATTT ATTT:43 ATT:1 CDS - ECO_0045 gyrA DNA gyrase
chr 100541 del CAAA CAA CAA:38 CAAA:1 CDS + ECO_0179 hypothetical protein
plas 619 complex GATC AATA GATC:28 AATA:0
plas 3221 mnp GA CT CT:39 CT:0 CDS + ECO_p012 rep hypothetical protein

Phylogenomics

Phylogenetics 101
∷Choose some genes
∷Sequence each gene from each isolate
∷Align the protein sequences of each gene
∷Back-align to nucleotide space
∷Concatenate all the alignments
∷Construct a distance matrix (many ways)
∷Draw a tree (many ways)
∷Make wild inferences from little data

Phylogenomics 101
∷Assemble each genome

∷Perform whole genome alignment
:in nucleotide space, as don’t know what is coding
:very computationally expensive
:can’t parallelize as with individual genes

∷Continue as for phylogenetics

bug1 GATTACCAGCATTAAGG-TTCTCCAATC
bug2 GAT---CTGCATTATGGATTCRNCATTC
bug3 G-TTACCAGCACTAA-------CCAGTC

∷Ideally, feed this directly to a tree builder
∷Properly model gaps, codons and ambiguity
∷Hard!
Whole genome alignment

Core genome SNPs

bug1 GATTACCAGCATTAAGG-TTCTCCAATC
bug2 GAT---CTGCATTATGGATTCRNCATTC
bug3 G-TTACCAGCACTAA-------CCAGTC
core | | ||||||||| ||||||

Core sites are present in all genomes.
Core genome

bug1 GATTACCAGCATTAAGG-TTCTCCAATC
bug2 GAT---CTGCATTATGGATTCRNCATTC
bug3 G-TTACCAGCACTAA-------CCAGTC
core | | ||||||||| ||||||
SNPs | | | | |

Core SNPS = polymorphic sites in core genome
Core SNPs

bug1 GATTACCAGCATTAAGG-TTCT CCAATC
bug2 GAT---CTGCATTATGGATTCR NCATTC
bug3 G-TTACCAGCACTAA------- CCAGTC
core | | ||||||||| ||||||
SNPs | | | | |
SNPs’ | | | |


Unambiguous core SNPs

bug1 GATTACCAGCATTAAGG-TTCTCCAATC
bug2 GAT---CTGCATTATGGATTCRNCATTC
bug3 G-TTACCAGCACTAA-------CCAGTC
SNPs’ | | | |
ata ttc ata atg
1 2 3 4

Allele sites

>bug1
ATAA
>bug2
TTTT
>bug3
ACAG
Alignment ⇢Tree
+------ bug3
|
---+--- bug1
|
+--------- bug2

--- 1 SNP

The N±1 problem

Aligning to reference
∷Why is whole genome alignment not used?
:involves genome (mis)assembly
:computationally difficult
:expensive to add or remove isolates

∷Short-cut
:choose a single reference
:align each isolates reads to the reference
:core, by definition, must include the reference

Read mapping considerations
∷Choice of reference

∷Too divergent?
:reads may not align well
:will get too many core genome SNPs

∷ One solution
:Assemble one isolate and use as the reference

SNPs | | | | |
core | | ||||||||| ||||||
bug1 GATTACCAGCATTAAGG-TTCTCCAATC
bug2 GAT---CTGCATTATGGATTCRNCATTC
bug3 G-TTACCAGCACTAA-------CCAGTC
core1 ||| ||||||||||| ||||||||||
SNPs1 | | || |

Remove taxon, different core (1)

SNPs | | | | |
core | | ||||||||| ||||||
bug1 GATTACCAGCATTAAGG-TTCTCCAATC
bug2 GAT---CTGCATTATGGATTCRNCATTC
bug3 G-TTACCAGCACTAA-------CCAGTC
core2 | | ||||||||| ||||||
SNPs2 | | | | |

Remove taxon, different core (2)

SNPs | | | | |
core | | ||||||||| ||||||
bug1 GATTACCAGCATTAAGG-TTCTCCAATC
bug2 GAT---CTGCATTATGGATTCRNCATTC
bug3 G-TTACCAGCACTAA-------CCAGTC
core3 | ||||||||||||| ||||||
SNPs3 | |

Remove taxon, different core (3)

Core genome alignments
∷Core SNP alignments
:can shift dramatically with taxa content
:we are only using globally conserved sites
:remember variation still exists outside “core”

∷Snippy will keep the full alignments
:quickly derive subsets on the fly
:adding isolates can be done quickly too

Conclusion

Snippy summary
∷The good
:Fast, scales to 100 cores
:Simple, clean interface and output

∷The bad
:Doesn’t do full consequences yet using snpEff

∷The ugly?
:Written in Perl

Contact
∷tseemann.github.io

∷github.com/tseemann/snippy

∷@torstenseemann