Genome annotation 2013

50,025 views 48 slides Jan 27, 2014
Slide 1
Slide 1 of 48
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48

About This Presentation

Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.


Slide Content

Genome Annotation
Karan Veer Singh,
Scientist.
NBAGR, Karnal,
India
1

•The genome contains all the biological information required to
build and maintain any given living organism
•The genome contains the organisms molecular history
•Decoding the biological information encoded in these molecules
will have enormous impact in our understanding of biology
The Genome

1.Structural genomics-genetic and physical mapping of genomes.
2.Functional genomics -analysis of gene function (and non-genes).
3.Comparative genomics -comparison of genomes across species.
Includes structural and functional genomics.
Evolutionary genomics.
Genomics

The Human genome project promised to
revolutionise medicine and explain every
base of our DNA.
Large MEDICAL GENETICS focus
Identify variation in
the genome that is
disease causing
Determine how individual
genes play a role in health
and disease
Human Genome Project

Human Genome Project & Functional
Genome
It cost 3 billion dollars and took 10 years to complete (5 less than
initially predicted).
•Approx 200 Mb still in progress
–Heterochromatin
–Repetitive

Genomics & Genome
annotation
First genome annotation software system was designed in 1995 by Dr.
Owen White with The Institute for Genomic Research that sequenced
and analyzed the first genome of a free-living organism to be decoded,
the bacterium Haemophilus influenzae
It involve assembling of the reads to form contigs then assembling with
a reference genome (reference assembly) or de novo assembly to
obtain the complete genome
Variations such as mutations, SNP, InDels etc can be identified
The genome is then annotated by structural and functional annotation
Mapping Image of Whole genome in an easily understandable manner.

Sequence to Annotation

Input1 to Genome Viewer- Variant
Annotation

Input2 to Genome Viewer- Structural
Annotation
Structural Annotation- AUGUSTUS (version
2.5.5)

Input3 to Genome Viewer-Functional
Annotation

Genome Annotation
The process of identifying the locations of
genes and the coding regions in a genome to
determe what those genes do
Finding and attaching the structural elements
and its related function to each genome
locations
11

Genome Annotation
12
gene structure prediction
Identifying elements
(Introns/exons,CDS,stop,start)
in the genome
gene function prediction
Attaching biological information
to these elements- eg: for which
protein exon will code for

Structural annotation
Structural annotation - identification of genomic elements
Open reading frame and their localisation
gene structure
coding regions
location of regulatory motifs

Functional annotation
Functional annotation- attaching biological
information to genomic elements
biochemical function
biological function
involved regulations

Genome annotation - workflow
16
Genome sequence
Repeats
Structural annotation-Gene finding
Protein-coding genesnc-RNAs (tRNA, rRNA),
Introns
Functional annotation
View in Genome viewer
Masked or un-masked genome sequence

Genome Repeats & features
17
 Percentage of repetitive sequences in different organisms
Genome Genome Size
(Mb)
% Repeat
Aedes aegypti 1,300 ~70
Anopheles gambiae 260 ~30
Culex pipiens 540 ~50
Microsatellite
Minisatellite
Tandem repeat
Short tandem repeat
SSR
Polymorphic between individuals/populations

Finding repeats as a preliminary to gene prediction
18
 Repeat discovery
Homology based approaches
Use RepeatMasker to search the genome and mask the sequence

Masked sequence
Repeatmasked sequence is an artificial construction where those regions which
are thought to be repetitive are marked with X’s
Widely used to reduce the overhead of subsequent computational analyses and
to reduce the impact of TE’s in the final annotation set
19
>my sequence
atgagcttcgatagcgatcagctagcgatcaggct
actattggcttctctagactcgtctatctctatta
gctatcatctcgatagcgatcagctagcgatcagg
ctactattggcttcgatagcgatcagctagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctactattggctgatcttaggtcttctga
tcttct
>my sequence (repeatmasked)
atgagcttcgatagcgatcagctagcgatcaggct
actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxatctcgatagcgatcagctagcgatcagg
ctactattxxxxxxxxxxxxxxxxxxxtagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctxxxxxxxxxxxxxxxxxxxtcttctga
tcttct
Positions/locations are not affected by masking

Types of Masking- Hard or Soft?
Sometimes we want to mark up repetitive sequence but not to exclude it from
downstream analyses. This is achieved using a format known as soft-masked
20
>my sequence
ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC
TACTATTGGCTTCTCTAGACTCGTCTATCTCTATT
AGTATCATCTCGATAGCGATCAGCTAGCGATCAGG
CTACTATTGGCTTCGATAGCGATCAGCTAGCGATC
AGGCTACTATTGGCTTCGATAGCGATCAGCTAGCG
ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA
TCTTCT
>my sequence (softmasked)
ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC
TACTATTggcttctctagactcgtctatctctatt
agtatcATCTCGATAGCGATCAGCTAGCGATCAGG
CTACTATTggcttcgatagcgatcagcTAGCGATC
AGGCTACTATTggcttcgatagcgatcagcTAGCG
ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA
TCTTCT
>my sequence (hardmasked)
atgagcttcgatagcgatcagctagcgatcaggct
actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxatctcgatagcgatcagctagcgatcagg
ctactattxxxxxxxxxxxxxxxxxxxtagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctxxxxxxxxxxxxxxxxxxxtcttctga
tcttct

Genome annotation - workflow
21
Genome sequence
Map repeats
Gene finding- structural annotation
Protein-coding genes nc-RNAs, Introns
Functional annotation
View in Genome viewer
Masked or un-masked

Structural annotation
Identification of genomic elements
Open reading frame and their localization
Coding regions
Location of regulatory motifs
Start/Stop
Splice Sites
Non coding Regions/RNA’s
Introns
22

Methods
24
Similarity
•Similarity between sequences which does not necessarily infer any
evolutionary linkage
 Ab- initio prediction
•Prediction of gene structure from first principles using only the genome
sequence

Genefinding
25
ab initio similarity

ab initio prediction
26
Genome
Coding
potential
Coding
potential
ATG & Stop
codons
ATG & Stop
codons
Splice sites
Examples:
Genefinder, Augustus,
Glimmer, SNAP, fgenesh

Genefinding - similarity
27
 Use known coding sequence to define coding regions
 EST sequences
 Peptide sequences
Problem to handle fuzzy alignment regions around splice sites
Examples: EST2Genome, exonerate, genewise, Augustus,
Prodigal
Gene-finding - comparative
Use two or more genomic sequences to predict genes based on
conservation of exon sequences
 Examples: Twinscan and SLAM

Genome annotation - workflow
28
Genome sequence
Map repeats #a ar %rcir nn r
Gene finding- structural annotation
Protein-coding genes nc-RNAs, Introns
Functional annotation
View in Genome viewer
Masked or un-masked

Genefinding - non-coding RNA genes
29
 Non-coding RNA genes can be predicted using knowledge of their
structure or by similarity with known examples
 tRNAscan - uses an HMM and co-variance model for prediction of
tRNA genes
 Rfam - a suite of HMM’s trained against a large number of different
RNA genes

Gene-finding omissions
30
Alternative isoforms
Currently there is no good method for predicting alternative isoforms
Only created where supporting transcript evidence is present
Pseudogenes
Each genome project has a fuzzy definition of pseudogenes
Badly curated/described across the board
Promoters
Rarely a priority for a genome project
Some algorithms exist but usually not integrated into an annotation set

Practical- structural annotation
31
Eukaryotes- AUGUSTUS (gene model)
~/Programs/augustus.2.5.5/bin/augustus --strand=both --genemodel=partial
--singlestrand=true --alternatives-from-evidence=true --alternatives-from-sampling=true
--progress=true --gff3=on --uniqueGeneId=true --species=magnaporthe_grisea
our_genome.fasta >structural_annotation.gff

Prokaryotes – PRODIGAL (Codon Usage table)
~/Programs/prodigal.v2_60.linux -a protein_file.fa -g 11 –d nucleotide_exon_seq.fa
-f gff -i contigs.fa -o genes_quality.txt -s genes_score.txt -t genome_training_file.txt

Structural Annotation-output
Structural Annotation conducted using AUGUSTUS (version 2.5.5),
Magnaporthe_grisea as genome model

Functional
annotation
33

Genome annotation - workflow
34
Genome sequence
Map repeats
Gene finding- structural annotation
Protein-coding genes nc-RNAs, Introns
Functional annotation
View in Genome viewer
Masked or un-masked

Functional annotation
35
Genome
ATG STOP
AAAn
A B
Transcription
Primary Transcript
Processed mRNA
Polypeptide
Folded protein
Functional activity
Translation
Protein folding
Enzyme activity
RNA processing
m
7
G
Find function

Functional annotation
36
Attaching biological information to genomic elements
Biochemical function
Biological function
Involved regulation and interactions
Expression
• Utilize known structural annotation to predicted protein sequence

Functional annotation – Homology Based
Predicted Exons/CDS/ORF are searched against the non-redundant
protein database (NCBI, SwissProt) to search for similarities
Visually assess the top 5-10 hits to identify whether these have
been assigned a function
Functions are assigned
37

Functional annotation - Other features
Other features which can be determined
Signal peptides
Transmembrane domains
Low complexity regions
Various binding sites, glycosylation sites etc.
Protein Domain
Secretome
See http://expasy.org/tools/ for a good list of possible prediction algorithms
38

Functional annotation - Other features
(Ontologies)
Use of ontologies to annotate gene products
Gene Ontology (GO)
Cellular component
Molecular function
Biological process
39

Practical - FUNCTIONAL
ANNOTATION
Homology Based Method
setup blast database for nucleotide/protein
Blasting the genome.fasta for annotations (nucleotide/protein)
sorting for blast minimum E-value (>=0.01) for nucleotide/protein
assigning functions
40

Functional annotation- output
August 2008 Bioinformatics tools for Comparative Genomics
of Vectors
41

Conclusion
Annotation accuracy is dependent available supporting data at the
time of annotation; update information is necessary
Gene predictions will change over time as new data becomes
available (NCBI) that are much similar than previous ones
Functional assignments will change over time as new data becomes
available (characterization of hypothetical proteins)
42

Genome annotation - workflow
43
Genome sequence
Map repeats
Gene finding- structural annotation
Protein-coding genes nc-RNAs, Introns
Functional annotation
View in Genome viewer
Masked or un-masked

Genome Viewer
The Files that can be visualised
Annotation files
Indel files
Consensus sequence
Comparative Genomics 44

Genome View
August 2008 45

46

47

48

Short Read track
49

Thank You
50