Metagenomic analysis

ANIMESH911 4,471 views 53 slides Aug 11, 2017
Slide 1
Slide 1 of 53
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53

About This Presentation

Metagenomics Analysis- An overview


Slide Content

Animesh kumar
M. Sc. (Bioinformatics)
Roll no. 20406
IASRI, New Delhi
November 15
th
, 2014
METAGENOMIC ANALYSIS: An Overview
1

❶ Introduction
❷ Metagenomic data collection
❸ Bioinformatics approaches
❹ Sharing and storage of data
❺ Data analysis using MEGAN
❻ Application
❼ Conclusion
❽ References
Overview
●Assembly
●Annotation
●Binning
●Experimental design and statistical
analysis
●Sampling and processing
●Sequencing technology
2

●Metagenomics (also Environmental Genomics, Ecogenomics or
Community Genomics) is the study of genetic material recovered
directly from environmental samples.
●The term "metagenomics" was first used by Jo Handelsman in 1998.
Introduction
●Current definition:
“The application of modern genomics techniques to the
study of communities of microbial organisms directly
in their natural environments, bypassing the need for
isolation and lab cultivation of individual species.”
Chen, K., Pachter, L. (2005).
3/54

•Majority of microorganisms have not been cultivated in the laboratory
•Metagenomics entails extraction of DNA from a community so that all
of the genomes of organisms in the community are pooled
•A genome is the entire genetic information of a single organism
•A metagenome is the entire genetic information of a assemblage of
organisms
4/54
Introduction conti..

Why Do METAGENOMICS?
UnderstandingUnderstanding
MetabolismMetabolism
Defining the Defining the
MinimalMinimal
Gene SetGene Set
Genome Genome
EngineeringEngineering
Understanding Cell Understanding Cell
Structure & FunctionStructure & Function
Understanding Understanding
Host InteractionsHost Interactions
Understanding Understanding
Protein-ProteinProtein-Protein
InteractionsInteractions
UnderstandingUnderstanding
ExpressionExpression
(RNA/Protein)(RNA/Protein)
Discover DNA Discover DNA
Variation, GenotypingVariation, Genotyping
ForensicsForensics
Drug/VaccineDrug/Vaccine
DevelopmentDevelopment
5/54

•For a typical sequence-based metagenome project one need to go
Figure 1. Flow diagram of a typical metagenomic project.
(Dashed arrow indicate steps that can be omitted)










Experimental design
Sampling

Sample fractionation
DNA extraction
DNA sequencing
Assembly

Binning

Annotation
Statistical analysis
Data storage
Data sharing

6/54

Sampling and Processing
The DNA extracted should represent all cell present in the sample
Sufficient amount of high-quality nucleic acids must be obtained for
subsequent library production and sequencing
•If the target community is associated with a host (e.g. an invertebrate
or plant)
•fractionation or selective lysis is done to ensure that minimal host DNA is
obtained(when the host genome is large)
•when only a certain part of the community is the target of analysis (eg.
viruses in seawater samples)
•physical separation is applied and a range of selective filtration or
centrifugation steps or flow cytometry is used to enrich the target fraction
Metagenomic data collection
7/54

Sampling and Processing
•Certain types of samples often yield only very small amounts of
DNA (such as biopsies or groundwater)
•Amplification of starting material might be required
•Multiple displacement amplification (MDA) using random hexamers
and phage phi29 polymerase is used to increase DNA yields.
•amplify femtograms of DNA to produce micrograms of product
•Problems associated with amplification method are
•reagent contaminations
•chimera formation
•sequence bias in the amplification
Metagenomic data collection
8/54

NGS Technology
Metagenomic data collection
9/54

Need sequence Quality filtering
•Eu-Detect
(http://metagenomics.atc.tcs.com/Eu-Detect/)
•DeConseq
(http://deconseq.sourceforge.net/)
Removes eukaryotic genomic
DNA sequences
Quality plots and read trimming
•FastQC
(http://www.bioinformatics.babraham.ac.uk)
•FASTX
(http://hannonlab.cshl.edu/fastx_toolkit/)
Bioinformatics approaches
10/54

11/47
Bioinformatics approaches
•Assembly- putting sequence fragments of DNA into their
correct chromosomal position
•Binning- Binning is the process of grouping reads or contigs
and assigning them to operational taxonomic units (OTUs)
•Annotation- annotation is the process of marking the genes and
other biological features in a DNA sequence
11

12/47
Two strategies can be employed for metagenomics samples
i.Reference-based assembly (co-assembly)
ii.De novo assembly
Highly non-redundant
Short read length so error prone
Mis-assemblies due to repetitive DNA contig
Chimera formation
Metagenomic data are
Bioinformatics approaches
12

Reference-based assembly (co-assembly)
•Metagenomic dataset contains sequences closely related to
reference genomes
•Large insertion, deletion, or polymorphisms can mean that the
assembly is fragmented or divergent regions are not covered
•Software packages used are Newbler (Roche), MIRA or AMOS
(http://sourceforge.net/projects/amos/ )
Assembly
13/54

De novo assembly
•Requires larger computational resources and time
•Tools based on the de Bruijn graphs was specifically created to
handle very large amounts of data
•Bruijn-type assemblers MetaVelvet, Meta-IDBA, Velvet, SOAP
•Used to identify within the entire de Bruijn graph, a subgraph that
represents related genomes
•These subgraphs or subsets are then resolved to build a consensus
sequence of the genomes
Assembly
14/54

Factors needed to consider during assembling
Length of sequencing reads
•Longer the sequence information, the better is the ability to
obtain accurate information.
•MG-RAST require only 75 bp or longer for gene prediction or
similarity analysis
•IMG/M prefer assembled contigs
Number of sequencing reads
•single reads have generally lower quality and hence lower
confidence in accuracy
•merging reads increases the quality of information
Assembly
15/54

•Binning is the process of grouping reads or contigs and assigning
them to operational taxonomic units (OTUs)
•It is the process of sorting DNA sequences into groups
i.Compositional-based binning algorithms - genomes have
conserved nucleotide composition
ii.Similarity-based binning algorithms – similarity of unknown
gene with known genes in a reference database can be used
to classify
Developed algorithms
Binning
16/54

●Compositional-based binning software include PhyloPythia, S-
GSOM, PCAHIER, and TACAO
●Similarity-based binning software include IMG/M, MG-RAST,
MEGAN, CARMA and MetaPhyler
●Both composition and similarity based binning algorithm include
PhymmBL and MetaCluster
Binning
17/54

Wu and Ye (2011), developed a novel approach (AbundanceBin) for
metagenomics binning by utilizing the different abundances of
species living in the same environment.
The fundamental assumption of this method is that reads are
sampled from genomes which follows a Poisson distribution
(Lander and Waterman, 1988).
For metagenomics, the sequencing reads is considered as a mixture
of Poisson distributions.
AbundanceBin
18/54
Binning

 In random shotgun sequencing of a genome,
19/54
probability of a read starting from a certain position = N/(G - L + 1),
Where, N=number of reads,
G= genome size,
L = Reads length
So, N/(G - L + 1) ≈ N/G (as G >> L)
l-mer w
L
G
DNA
sequence
S
h
o
r
t
g
u
n
c
o
v
e
r
a
g
e
b
y

c
lo
n
e
s
o
f
e
q
u
a
l le
n
g
t
h
L
Binning
AbundanceBin

Assume x is a read and a l-tuple w belongs to x.
Number of occurrences of w in the set of reads follows a Poisson
distribution with parameter λ with read length L.
λ = N (L - l+ 1)/ (G – L+ 1) ≈ NL /G
 Similar is for metagenome dataset (G= total length of genomic seq.)
Here, reads are from species with different abundances.
If the abundance of a species i is n, the total number of occurrences
of w in the whole set of reads coming from this species should
follow a Poisson distribution with parameter λ
i
= n λ,
This is now probability of mixed Poisson distribution
20/54
Binning
AbundanceBin

For a set of metagenomic sequences, count l-tuples in all reads.
x = {n(w
i
)} (i [1, W]

),
•Where, n(w
i) observed count of tuple i and
•W total number of possible l-tuples.
21/54
Figure 2. (a) Illustration of AbundanceBin pipeline. (b) The recursive binning approach used
to automatically determine the number of bins. (Yu-Wei Wu and Yuzhen Ye (2011))
Binning
AbundanceBin (Algorithm)

S = total number of bins.
g = {g
i
} and λ = { λ
i
} (i [1, W]),

•where g
i
and l
i
are the (collective) genome size and abundance level of bin i
θ = {S, g, λ}
Objective: To optimize log of joint probability using EM algorithm
22/54
Binning
AbundanceBin (Algorithm)

The EM steps are as follows
1.Initialize S, g
i
, and λ
i
for i= 1, 2, . . . ,S.
2.Calculate the probability that the l-tuple w
j
(j = 1, 2, …, W; W
total number of possible l-tuples) coming from i
th
species given its
count n(w
j
).
23/54
( )
å
=
-
÷
÷
ø
ö
ç
ç
è
æ

s
m
wn
i
m
m
i
jij
mi
j
eg
g
wnswP
1
)(
))(|(
ll
l
l
Binning
AbundanceBin

3. Similarly, g
i
and λ
i
is calculated
,
4. Iterate steps 2 and 3 until the parameters converge or the number of
runs exceeds a maximum number of runs.
The convergence of parameters is defined as
and
Where, λ
i
(t) and g
i
(t) represent the abundance level and genome
length of bin i at iteration t, respectively.
24/54
ï
þ
ï
ý
ü
ï
î
ï
í
ì
<"
-
+
5
1
10
i
t
t
i
i
l
l
l
ï
þ
ï
ý
ü
ï
î
ï
í
ì
<"
-
+
5
1
10
i
t
t
i
i
g
g
g
å
=
Î=
W
j
jiji wnswPg
1
))((
i
W
j
jijj
i
g
wnswPwnå
=
Î
=
1
))(()(
l
Binning
AbundanceBin

• Probability of a read assigned to a bin, based on its l-tuples binning
results as
Where, r
k
is a given read, w
j
is the l-tuples that belong to r
k
, and s
i
is any
bin.
A read will be assigned to the bin with the highest probability among
all bins.
25/54
( )åÕ
Õ
Î Î
Î
Î
Î

Ss rw
jij
rw
jij
ik
i kj
kj
wnswP
wnswP
srP
))((
))((
)(
Binning

• Given a metagenomic dataset, AbundanceBin is first used to classify
reads into different bins (abundance bins),
•Then MetaCluster is used to further classify reads in each abundance
bin into species bins, each containing reads sampled from a species.
•Such two-step approach may achieve higher binning accuracy than
using composition-based methods alone,
Eg. Binning of acid mine drainage (AMD) dataset.
It consist of two species of high abundance and three other less
abundant species (Tyson et al., 2004).
Binning
26/54
Binning

•With the difference of two abundance levels in this environment,
expectation from this algorithm could classify the AMD dataset into
two bins.
•Wu and Ye (2011) applied AbundanceBin to reads from the actual
AMD dataset (downloaded from NCBI trace archive;
13696_environmental_sequence.007).
• AbundanceBin successfully classified these reads into exactly two
bins (one of high abundance and one of low abundance) using the
recursive binning approach.
Binning
27/54
Binning

•Fig. The recursive binning of a read dataset into six bins of different
abundances each box represents a bin with the numbers indicating the
abundance of the reads classified to that bin (e.g., the bin on the top
has all the reads, which will be divided into two bins, one with reads
of abundances 1.5, 4, 8, and 64, and the other bin with reads of
abundances 32 and 64).
Binning
28/54
1.5+4+8+16+32+64
1.5+4+8+16 32+64
1.5+4 8+16 32 64
32 32 32 32
Binning

Annotation
●Genome annotation is the process of attaching biological information
to sequences
● Tools designed to handle metagenomic prediction of CDS
•FragGeneScan (FGS)
•MetaGeneMark
•MetaGeneAnnotator (MGA)/ Metagene
•Orphelia
•All these tools use internal information (eg. Codon usage) to
classify sequence stretches as coding or non-coding.
29/54

•Metagenomic datasets are typically very large, so manual
annotation is not possible
•Many reference databases are available to give functional context
to metagenomic datasets, such as KEGG, eggNOG, COG/KOG,
PFAM, and TIGRFAM
•MG-RAST and IMG/M merge the interpretations of all database
searches in a single framework
Annotation
30/54

•MG-RAST, IMG/M and CAMERA are three prominent large-scale
databases that process and deposit metagenomic datasets
•MG-RAST is a data repository, web interface allows comparison using
a number of statistical techniques
•IMG/M and MG-RAST provide the ability to use stored computational
results for comparison and reanalysis
• CAMERA integrates a growing list of tools and viewers for querying,
analyzing, annotating and comparing metagenome and genomic data
Annotation
31/54

•Metagenomic data however often contain many more species or gene
functions then the number of samples taken
•Microbial systems are highly dynamic, so temporal aspects of
sampling can have a substantial impact on data analysis and
interpretation
•Taking multiple samples and then pooling them will lose all
information on variability and hence will be of little use for statistical
purposes
Experimental Design and Statistical Analysis
32/54

Experimental Design and Statistical Analysis
•Metagenomic shotgun-sequencing projects are costly so not
replicated
•These design and statistical aspects, are often not properly
implemented in the field of microbial ecology
•In multiple metagenomic shotgun-sequencing projects
•Data can be reduced to tables
•columns represent samples
•rows indicate either a taxonomic group or a gene function (or
groups thereof)
•fields containing abundance or presence/absence data
33/54

•Appropriate corrections for multiple hypothesis testing have to be
implemented (e.g. Bonferroni correction for t-test based analyses)
if necessary
•The Primer-E package and shotgun Functionalize R package
provides several statistical procedures for assessing functional
differences between samples
•Recently, multivariate statistics was also incorporated in a web-
based tools called Metastats
Experimental Design and Statistical Analysis
34/54

Sharing and Storage of Data
•Sharing of both data and computational results helps user to access
metadata and centralized services (e.g., IMG/M, CAMERA and MG-
RAST)
•The US National Center for Biotechnology Information (NCBI) is
mandated to store all metagenomic data
•As the cost of sequencing continues to drop while the cost for
analysis and storing remains more or less constant
•Storage of data in either biological (i.e. the sample that was
sequenced) or digital form in (de-) centralized archives are required
35/54

•MEGAN is another tool used for visualizing
annotation results derived from BLAST
searches in a functional or taxonomic
dendrogram
•It helps to interpret, which
makes analysis of particular
functional or taxonomic
groups visually
http://ab.inf.uni-tuebingen.de/software/megan5/
36/54
Analysis using MEGAN

How MEGAN works ?
Firstly, reads are collected from the sample using any random shotgun
protocol.
Secondly, a sequence comparison of all reads against one or more
databases of known reads is performed, using BLAST or a similar
comparison tool.
Thirdly, MEGAN processes the results of the comparison to collect
all hits of reads against known sequences and assigns a taxon ID to each
sequence based on the NCBI taxonomy.
Analysis using MEGAN
37/54

•This produces a MEGAN file that contains all information needed
for analyzing and generating graphical and statistical output.
•Fourthly, the user interacts with the program to run the lowest
common ancestor (LCA) algorithm
•to analyze the data,
•to inspect the assignment of individual reads to taxa based on
their hits, and
•to produce summaries of the results at different levels of the
NCBI taxonomy
Analysis using MEGAN
38/54

Analysis using MEGAN
•Daniel H. et al. (2011) compared a number of different data sets from
two published marine studies
•A metagenome (called DNA-Time1-Bag1, with 209,073 reads) and a
metatranscriptome (called cDNA-Time1-Bag1, with 131,089 reads)
from Gilbert et al. (2008)
•A 16S rRNA data set (849 reads) and a metaproteome (8073
sequences) from Morris et al. (2010)
39/54

•The metagenome, metatranscriptome, and metaproteome data sets
were blasted against NCBI-NR, whereas the 16S rRNA data set was
blasted against the SILVA database
•In addition, they imported the result of the analysis of the
metaproteome data set that was presented in Morris et al. 2010
•All five data sets were processed by MEGAN4, and the resulting
taxonomic analysis is shown in Figure M1
Analysis using MEGAN
40/54

Figure M1. MEGAN4 integrative taxonomic analysis
Taxonomic analysis using MEGAN
Analysis using MEGAN
41/54

Figure M2.a MEGAN4’s integrative functional analysis (using SEED)
Analysis using MEGAN
Functional analysis using SEED classification
42/54

Figure M2.b The classification tree has been partially expanded to show some details of the
subsystems below the Carbohydrates node.
Analysis using MEGAN
Functional analysis using SEED classification
43/54

Figure M3. A MEGAN4 integrative functional analysis (using KEGG)
•The classification tree has been expanded down to the second level of
the KEGG classification.
Analysis using MEGAN
Functional analysis using KEGG classification
44/54

Figure M4. A MEGAN4 integrative functional analysis (using KEGG)
Analysis using MEGAN
Functional analysis using KEGG classification
45/54

Community metabolism
•Division of labour in metabolism (Syntrophy) is seen
•Waste products of some organisms are metabolites for others. eg.
methanogenic bioreactor
Metatranscriptomics
•Helps to access the functional and metabolic diversity of microbial
communities, but it cannot show which of these processes are active.
•Provides information on the regulation and expression profiles of
complex communities
Application
46/54

Viruses
•As viruses lack a shared universal phylogenetic marker (as 16S RNA
for bacteria and archaea, and 18S RNA for eukarya)
•The only way to access the genetic diversity of the viral community
from an environmental sample is through metagenomics
•Viral metagenomes (also called viromes) provides more and more
information about viral diversity and evolution
Application
47/54

Conclusion
The potential for application of metagenomics to human benefit
seems endless
The scientific community should aim to share, compare, and
critically evaluate the outcomes of metagenomic studies
As datasets become increasingly more complex and comprehensive,
novel tools for analysis, storage, and visualization will be required
These will ensure the best use of the metagenomics as a tool to
address fundamental question of microbial ecology, evolution and
diversity and to derive and test new hypotheses
48/54

Conclusion
With improved methods for analysis, funding stimulated by recent
triumphs in the field, and attraction of diverse scientists to identify
new problems and solve old ones, metagenomics will expand and
continue to enrich our understanding of microorganisms
It is therefore also important that metagenomics be taught to
students and young scientists in the same way that other techniques
and approaches have been in the past
49/54

References
Chen, K., Pachter, L. (2005). Bioinformatics for Whole-Genome
Shotgun Sequencing of Microbial Communities. PLoS
Computational Biology, 1(2),106-12.
Glass EM, Wilkening J, Wilke A, Antonopoulos D, Meyer F,
(2010). Using the metagenomics RAST server (MG-RAST) for
analyzing shotgun metagenomes. Cold Spring Harb Protocol,
1,5368.
Huson DH, Auch AF, Qi J, Schuster SC, (2007). MEGAN analysis
of metagenomic data. Genome Research, 17(3), 377-386.
Li X. Waterman M.S. (2003). Estimating the repeat structure and
length of DNA sequences using l-tuples. Genome
Research, 13,1916–1922.
50/54

Markowitz VM, Ivanova NN, et al. ,(2008). IMG/M: a data
management and analysis system for metagenomes. Nucleic
Acids Research, 36, D534-538.
Miller JR, Koren S, Sutton G (2010). Assembly algorithms for
next-generation sequencing data. Genomics, 95(6),315-327.
Thomas et al., (2012). Metagenomics - a guide from sampling to
data analysis. Microbial Informatics and Experimentation 2,3.
Yu-Wei Wu and Yuzhen Ye, (2011). A novel abundance-based
algorithm for binning metagenomic sequences using l-tuples.
Journal of Computational Biology, 18(3),523-534.
References
51/54

Yok NG, Rosen GL (2011). Combining gene prediction methods to
improve metagenomic gene annotation. BMC Bioinformatics,
12,20.
Z L Sabree, M R Rondon, and J Handelsman, (2009) .
Metagenomics. Elsevier, 10,622-632.
References
52/54

53/54