Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.
Size: 1.68 MB
Language: en
Added: Jan 27, 2014
Slides: 48 pages
Slide Content
Genome Annotation
Karan Veer Singh,
Scientist.
NBAGR, Karnal,
India
1
•The genome contains all the biological information required to
build and maintain any given living organism
•The genome contains the organisms molecular history
•Decoding the biological information encoded in these molecules
will have enormous impact in our understanding of biology
The Genome
1.Structural genomics-genetic and physical mapping of genomes.
2.Functional genomics -analysis of gene function (and non-genes).
3.Comparative genomics -comparison of genomes across species.
Includes structural and functional genomics.
Evolutionary genomics.
Genomics
The Human genome project promised to
revolutionise medicine and explain every
base of our DNA.
Large MEDICAL GENETICS focus
Identify variation in
the genome that is
disease causing
Determine how individual
genes play a role in health
and disease
Human Genome Project
Human Genome Project & Functional
Genome
It cost 3 billion dollars and took 10 years to complete (5 less than
initially predicted).
•Approx 200 Mb still in progress
–Heterochromatin
–Repetitive
Genomics & Genome
annotation
First genome annotation software system was designed in 1995 by Dr.
Owen White with The Institute for Genomic Research that sequenced
and analyzed the first genome of a free-living organism to be decoded,
the bacterium Haemophilus influenzae
It involve assembling of the reads to form contigs then assembling with
a reference genome (reference assembly) or de novo assembly to
obtain the complete genome
Variations such as mutations, SNP, InDels etc can be identified
The genome is then annotated by structural and functional annotation
Mapping Image of Whole genome in an easily understandable manner.
Sequence to Annotation
Input1 to Genome Viewer- Variant
Annotation
Input2 to Genome Viewer- Structural
Annotation
Structural Annotation- AUGUSTUS (version
2.5.5)
Input3 to Genome Viewer-Functional
Annotation
Genome Annotation
The process of identifying the locations of
genes and the coding regions in a genome to
determe what those genes do
Finding and attaching the structural elements
and its related function to each genome
locations
11
Genome Annotation
12
gene structure prediction
Identifying elements
(Introns/exons,CDS,stop,start)
in the genome
gene function prediction
Attaching biological information
to these elements- eg: for which
protein exon will code for
Structural annotation
Structural annotation - identification of genomic elements
Open reading frame and their localisation
gene structure
coding regions
location of regulatory motifs
Functional annotation
Functional annotation- attaching biological
information to genomic elements
biochemical function
biological function
involved regulations
Genome Repeats & features
17
Percentage of repetitive sequences in different organisms
Genome Genome Size
(Mb)
% Repeat
Aedes aegypti 1,300 ~70
Anopheles gambiae 260 ~30
Culex pipiens 540 ~50
Microsatellite
Minisatellite
Tandem repeat
Short tandem repeat
SSR
Polymorphic between individuals/populations
Finding repeats as a preliminary to gene prediction
18
Repeat discovery
Homology based approaches
Use RepeatMasker to search the genome and mask the sequence
Masked sequence
Repeatmasked sequence is an artificial construction where those regions which
are thought to be repetitive are marked with X’s
Widely used to reduce the overhead of subsequent computational analyses and
to reduce the impact of TE’s in the final annotation set
19
>my sequence
atgagcttcgatagcgatcagctagcgatcaggct
actattggcttctctagactcgtctatctctatta
gctatcatctcgatagcgatcagctagcgatcagg
ctactattggcttcgatagcgatcagctagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctactattggctgatcttaggtcttctga
tcttct
>my sequence (repeatmasked)
atgagcttcgatagcgatcagctagcgatcaggct
actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxatctcgatagcgatcagctagcgatcagg
ctactattxxxxxxxxxxxxxxxxxxxtagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctxxxxxxxxxxxxxxxxxxxtcttctga
tcttct
Positions/locations are not affected by masking
Types of Masking- Hard or Soft?
Sometimes we want to mark up repetitive sequence but not to exclude it from
downstream analyses. This is achieved using a format known as soft-masked
20
>my sequence
ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC
TACTATTGGCTTCTCTAGACTCGTCTATCTCTATT
AGTATCATCTCGATAGCGATCAGCTAGCGATCAGG
CTACTATTGGCTTCGATAGCGATCAGCTAGCGATC
AGGCTACTATTGGCTTCGATAGCGATCAGCTAGCG
ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA
TCTTCT
>my sequence (softmasked)
ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC
TACTATTggcttctctagactcgtctatctctatt
agtatcATCTCGATAGCGATCAGCTAGCGATCAGG
CTACTATTggcttcgatagcgatcagcTAGCGATC
AGGCTACTATTggcttcgatagcgatcagcTAGCG
ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA
TCTTCT
>my sequence (hardmasked)
atgagcttcgatagcgatcagctagcgatcaggct
actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxatctcgatagcgatcagctagcgatcagg
ctactattxxxxxxxxxxxxxxxxxxxtagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctxxxxxxxxxxxxxxxxxxxtcttctga
tcttct
Structural annotation
Identification of genomic elements
Open reading frame and their localization
Coding regions
Location of regulatory motifs
Start/Stop
Splice Sites
Non coding Regions/RNA’s
Introns
22
Methods
24
Similarity
•Similarity between sequences which does not necessarily infer any
evolutionary linkage
Ab- initio prediction
•Prediction of gene structure from first principles using only the genome
sequence
Genefinding - similarity
27
Use known coding sequence to define coding regions
EST sequences
Peptide sequences
Problem to handle fuzzy alignment regions around splice sites
Examples: EST2Genome, exonerate, genewise, Augustus,
Prodigal
Gene-finding - comparative
Use two or more genomic sequences to predict genes based on
conservation of exon sequences
Examples: Twinscan and SLAM
Genome annotation - workflow
28
Genome sequence
Map repeats #a ar %rcir nn r
Gene finding- structural annotation
Protein-coding genes nc-RNAs, Introns
Functional annotation
View in Genome viewer
Masked or un-masked
Genefinding - non-coding RNA genes
29
Non-coding RNA genes can be predicted using knowledge of their
structure or by similarity with known examples
tRNAscan - uses an HMM and co-variance model for prediction of
tRNA genes
Rfam - a suite of HMM’s trained against a large number of different
RNA genes
Gene-finding omissions
30
Alternative isoforms
Currently there is no good method for predicting alternative isoforms
Only created where supporting transcript evidence is present
Pseudogenes
Each genome project has a fuzzy definition of pseudogenes
Badly curated/described across the board
Promoters
Rarely a priority for a genome project
Some algorithms exist but usually not integrated into an annotation set
Functional annotation
35
Genome
ATG STOP
AAAn
A B
Transcription
Primary Transcript
Processed mRNA
Polypeptide
Folded protein
Functional activity
Translation
Protein folding
Enzyme activity
RNA processing
m
7
G
Find function
Functional annotation
36
Attaching biological information to genomic elements
Biochemical function
Biological function
Involved regulation and interactions
Expression
• Utilize known structural annotation to predicted protein sequence
Functional annotation – Homology Based
Predicted Exons/CDS/ORF are searched against the non-redundant
protein database (NCBI, SwissProt) to search for similarities
Visually assess the top 5-10 hits to identify whether these have
been assigned a function
Functions are assigned
37
Functional annotation - Other features
Other features which can be determined
Signal peptides
Transmembrane domains
Low complexity regions
Various binding sites, glycosylation sites etc.
Protein Domain
Secretome
See http://expasy.org/tools/ for a good list of possible prediction algorithms
38
Functional annotation - Other features
(Ontologies)
Use of ontologies to annotate gene products
Gene Ontology (GO)
Cellular component
Molecular function
Biological process
39
Practical - FUNCTIONAL
ANNOTATION
Homology Based Method
setup blast database for nucleotide/protein
Blasting the genome.fasta for annotations (nucleotide/protein)
sorting for blast minimum E-value (>=0.01) for nucleotide/protein
assigning functions
40
Functional annotation- output
August 2008 Bioinformatics tools for Comparative Genomics
of Vectors
41
Conclusion
Annotation accuracy is dependent available supporting data at the
time of annotation; update information is necessary
Gene predictions will change over time as new data becomes
available (NCBI) that are much similar than previous ones
Functional assignments will change over time as new data becomes
available (characterization of hypothetical proteins)
42