Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
torstenseemann
3,183 views
38 slides
May 23, 2015
Slide 1 of 38
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
About This Presentation
Using Snippy to call variants in bacterial short read datasets via alignment to reference, and then using these alignments to produce core SNP alignments for phylogenomics.
Size: 1.89 MB
Language: en
Added: May 23, 2015
Slides: 38 pages
Slide Content
Snippy
Torsten Seemann
Balti & Bioinformatics - Birmingham, UK - Tue 5 May 2015
Rapid bacterial variant calling
& core genome alignments
Background
(Far) south east England
Phyloflagomics
UK / Birmingham Australia / Victoria Canada / British Columbia
A new home
Centre for Applied
Microbial Genomics
Microbiological Diagnostic Unit
∷Oldest public health lab in Australia
:established 1897 in Melbourne
:large historical isolate collection back to 1950s
∷National reference laboratory
:Salmonella, Listeria, EHEC
∷WHO regional reference lab
:vaccine preventable invasive bacterial pathogens
New director
∷Professor Ben Howden
:clinician, microbiologist, pathologist
:early adopter of genomics and bioinformatics
:long term collaborator on MRSA/VRE w/ Tim Stinear
∷Mandate
:modernise service delivery
:enhance research output and collaboration
:nationally lead the conversion to WGS
Hardware
∷Sequencers
:NextSeq 500
:3 x MiSeq
:PacBio RS II (arriving 22 May)
∷Robots
:Perkin Elmer (does not have a Twitter account)
:Colony picker
∷Compute
:240 TB, 10 GigE, 3 x 72 core boxes
Variant calling
Variant calling
∷Find DNA differences between genomes
:variants to explain phenotype
:validate your complemented mutant
∷Two approaches
:reference based (read alignment)
:reference-free (de novo assembly / k-mer based)
Snippy
∷Fast → snappy
∷Finds variants → SNPs
∷Australian → Skippy the bush kangaroo
Input
∷FASTQ files
:paired end, interleaved, or single-end
∷Reference
:FASTA or Genbank
∷Output folder
:self contained bundle of results
Inside the black box
∷bwa mem - no clipping needed
∷samtools - sorted, filtered BAM
∷freebayes - split / GNU parallel / merge
∷vcflib/vcftools - VCF filtering
∷perl - glue
Outputs
∷Read alignments
:.bam / .bai
∷Variants
:.vcf / .vcf.gz / .vcf.gz.tbi / .gff .bed .tab .csv .html
∷Consensus
:reference with all variants applied to it
∷Genome alignment
:reference with “-” (missing) and “N” low depth
TAB output
CHROM POS TYPE REF ALT EVIDENCE FTYPE STRAND NT_POS AA_POS LOCUS_TAG GENE PRODUCT
chr 5958 snp A G G:44 A:0 CDS + 41/600 13/200 ECO_0001 dnaA replication protein
DnaA
chr 35524 snp G T T:73 G:1 C:1 tRNA -
chr 45722 ins ATT ATTT ATTT:43 ATT:1 CDS - ECO_0045 gyrA DNA gyrase
chr 100541 del CAAA CAA CAA:38 CAAA:1 CDS + ECO_0179 hypothetical protein
plas 619 complex GATC AATA GATC:28 AATA:0
plas 3221 mnp GA CT CT:39 CT:0 CDS + ECO_p012 rep hypothetical protein
Phylogenomics
Phylogenetics 101
∷Choose some genes
∷Sequence each gene from each isolate
∷Align the protein sequences of each gene
∷Back-align to nucleotide space
∷Concatenate all the alignments
∷Construct a distance matrix (many ways)
∷Draw a tree (many ways)
∷Make wild inferences from little data
Phylogenomics 101
∷Assemble each genome
∷Perform whole genome alignment
:in nucleotide space, as don’t know what is coding
:very computationally expensive
:can’t parallelize as with individual genes
Aligning to reference
∷Why is whole genome alignment not used?
:involves genome (mis)assembly
:computationally difficult
:expensive to add or remove isolates
∷Short-cut
:choose a single reference
:align each isolates reads to the reference
:core, by definition, must include the reference
Read mapping considerations
∷Choice of reference
∷Too divergent?
:reads may not align well
:will get too many core genome SNPs
∷ One solution
:Assemble one isolate and use as the reference
Core genome alignments
∷Core SNP alignments
:can shift dramatically with taxa content
:we are only using globally conserved sites
:remember variation still exists outside “core”
∷Snippy will keep the full alignments
:quickly derive subsets on the fly
:adding isolates can be done quickly too
Conclusion
Snippy summary
∷The good
:Fast, scales to 100 cores
:Simple, clean interface and output
∷The bad
:Doesn’t do full consequences yet using snpEff