GENE PREDICTION
STRATEGIES IN
EUKARYOTES
Presented By:
Kamakshi Maheshwari (MTB/20/1004)
Bhagyashri Kesarwani (MTB/20/1012)
Vedika Rai (MTB/20/1024)
INTRODUCTION
■Computational gene prediction is becoming more and more essential for the automatic analysis and
annotation of large uncharacterized genomic sequences
■genome annotation is to predict all gene structures in a given genomic sequence.
■There are two basic problems in gene prediction:-
➢ prediction of protein coding regions
➢ prediction of the functional sites of genes
■In eukaryotic organisms, it is a quite different problem from that encountered in prokaryotes.
➢presence of intron sequences in the genomic DNA sequences of eukaryotes
■Many gene prediction programs have been developed which can be classified into four generations
■first generation of programs- to identify approximate locations of coding regions in genomic DNA
➢TestCode
➢GRAIL
■The second generation- combined splice signal and coding region identification to predict potential
exons, but did not attempt to assemble predicted exons into complete genes.
➢SORFIND
➢Xpound
■Next generation of programs- predicting complete gene structures.
➢GeneID
➢GeneParser
➢GenLang
➢FGENEH
■However, the performance of those programs remained rather poor.
■Those programs were all based on the assumption that the input sequence contains exactly one
complete gene, which is not often the case.
■To solve this problem and improve accuracy and applicability further, few programs were
developed, which could be classified into the fourth generation:-
➢GENSCAN
➢AUGUSTUS,
➢GENEID
■There are mainly two classes of methods for computational gene prediction. One is based on
sequence similarity searches, while the other is gene structure and signal-based searches, which is
also referred to as ab initio gene finding.
EUKARYOTIC GENE STRUCTURE PREDICTION
•Low gene density
•Space between genes
very large with multiple
repeated sequences and
transposable elements
•Eukaryotic genes are split
(introns/exons)
•Transcript is capped
(methylation of 5’ residue)
•Splicing in spliceosome
•Alternative splicing Poly adenylation (~250 As added) downstream of
CAATAAA(T/C)consensus box
•Major issue identification of splicing sites GT-AG rule (GTAAGT/
Y12NCAG 5’/3’ intron splice junctions)
METHODS OF GENE PREDICTION
■Sequence Similarity Based
–It is extrinsic approach that identifies
genes based on homology searches of
known databases (genomic DNA, dbEST
or protein)
–The comparison of two homologous
genomic sequences facilitates the
identification of conserved exons.
–When combined with signal sensors
(signals relating to transcription,
translation and splicing) can help refine
region boundaries accurately, allow
accurate model gene structure and
organisation.
■Ab Initio Based
–Intrinsic method which predicts
structure with help of signal and
content sensors based on given
sequence alone without prior
information of the gene such as with
Poly-A sites, Intron splice sites, etc
–They may also use nucleotide
composition based methods such as
Hidden Markov Model which
assumes that the probability of given
nucleotide occurs at dependant on
previous k nucleotides applying
conditional probabilities.
–Generalised HMM is most commonly
used that allows a string as an output
of the state.
AUGUSTUS
■AUGUSTUS is based on a
generalized Hidden
Markov Model (GHMM).
■AUGUSTUS is a program
that predicts genes in
eukaryotic genomic
sequences.
■It can be run through a web
interface http://
augustus.gobics.de/
■http://bioinf.uni-
greifswald.de/augustus/
submission.php
GENSCAN
GENSCAN was developed
by Chris Burge, Department of
Mathematics, Stanford
University.
It is a general-purpose gene
identification program which
analyses genomic DNA
sequences from a variety of
organisms including human,
other vertebrates, invertebrates
and plants.
Based on GHMM (General
Hidden Markov Model
library).
The sequence file may be in
either FASTA or minimal
GenBank format.
Used to predict the location of
genes and their exon-intron
boundaries in genomic
sequences.
http://hollywood.mit.edu/GENSCAN.html
GENEID
Geneid is a program to predict genes in
anonymous genomic sequences
designed with a hierarchical structure.
Step 1: Splice sites, start and stop codons
are predicted and scored along the
sequence using Position Weight Arrays
(PWAs).
Step 2: Exons are built from the sites &
are scored as the sum of the scores of
the defining sites, plus the the log-
likelihood ratio of a Markov Model for
coding DNA.
Step 3: From the set of predicted exons,
the gene structure is assembled,
maximising the sum of the scores of the
assembled exons.
Geneid is very efficient in terms of speed and memory usage. Currently, geneid v1.2 analyses 1Gbp/hour. (whole
human genome in 3 hours)
https://genome.crg.cat/software/geneid/geneid.html
GENIE
■The Genie system is a generalized hidden
Markov model (GHMM) that incorporates
signal and content sensor.
■ The most studied model is the sensor to
predict coding regions, referred to as
coding exons or simply exons.
■In Genie, these content sensors are mostly
based on the coding usage and coding
preferences as well as a length distribution
for these content sensors.
■The current Genie system is a newly trained
version of the original work to even include
training for Drosophila melanogaster. This
initial version was trained and optimized for
human genes.
EUGENE
■EuGene is an open integrative gene
finder for eukaryotic and prokaryotic
genomes.
■Compared to most existing gene
finders, EuGene is characterized by its
ability to simply integrate arbitrary
sources of information in its prediction
process, including RNA-Seq, protein
similarities, homologies and various
statistical sources of information.
■uGene-EP (Eukaryote Pipeline) Acan
exploit probabilistic models like Markov
models for discriminating coding from
non coding sequences or to
discriminate effective splice sites from
false splice sites (using various
mathematical models).