Finding genes

SabahatAli9 4,169 views 36 slides Feb 13, 2019
Slide 1
Slide 1 of 36
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36

About This Presentation

ORF is Open Reading Frame that are the Exon sequences which are to be translated in the process of translation


Slide Content

An Introduction to Bioinformatics
Finding genes in prokaryotes

Usually the primary challenge that follows the sequencing of
anything from a small segment of DNA to a complete genome
is to establish where the location functional elements such as:
genes (intron/exon boundaries)
promoters,
terminators etc
DNA sequences that may potentially encode proteins are called
Open Reading Frames (ORFs)
The situation in prokaryotes is relatively straightforward since
scarcely any eubacterial and archaeal genes contain introns

FINDING ORFs
The simplest method in prokaryotes is to scan the DNA for
start and stop codons
The DNA is double stranded and each strand has three
potential reading frames (codons are groups of 3 bases)
THE CAT ATE THE RAT Frame 1
T HEC ATA TET HER AT Frame 2
TH ECA TAT ETH ERA T Frame 3
The scan must look at all 6 reading frames

Any region of DNA between a start codon and a stop codon in
the same reading frame could potentially code for a polypeptide
and is therefore an ORF
Start AUG (methionine) Stop UAA UAG UGA
small potential coding sequences like this will occur frequently
by chance, and therefore the longer they are the more likely
they are to represent real coding regions, genes
Problems
Small genes may be missed
The actual start codon may be internal to the ORF
There may be overlapping genes

The simplest tool for finding ORFs is ORF Finder at NCBI
It simply scans all 6 reading frames and shows the position of
the ORFs which are greater than a user defined minimum size
The genetic code used for the analysis can be altered by the
user
This would be important if e.g. mitochondrial or ciliate nuclear
DNA were being analysed

To overcome the limitations of ORF finder, more sophisticated
programmes detect compositional biases and increase the
reliability of gene detection
These compositional biases are regular, though very diffuse,
And arise for a variety of reasons:
many organisms there is a detectable preference for G or C
over A and T in the third ("wobble") position in a codon
all organisms do not utilize synonymous codons with the same
frequency - consequently there is a codon bias
there is an unequal usage of amino acids in proteins sufficient to
cause a bias in all three positions of codons and increase the
overall codon bias

the %GC content of the first two codon positions of the
universal genetic code is approximately 50%, therefore,
organisms which have a low or high %GC content will exhibit
a marked bias at the third position of codons to achieve their
overall %GC content
The most recent approaches to using compositional features
to distinguish coding from non-coding regions employ ‘Markov
models’
such approaches include the popular GENEMARK and
GLIMMER programs

Finding Genes in Eukaryotes
An Introduction to Bioinformatics

AIMS
To establish the concept of ORFs and their relationship to genes
To describe the features used by software to find ORFs/genes
To become familiar with Web-based programmes used to find
ORFs/genes
OBJECTIVES
To be able to distinguish between the concepts of ORF and gene
Use ORF Finder to find ORFs in prokaryotic nucleotide sequences
To describe the complications of the eukaryote “signals”
To be aware of the Web-based programmes
To be able to use the eukaryote programmes for a number of
organisms

Organisms whose cells have a membrane-bound
nucleus and many specialised structures located within
their cell boundary.
In these organisms, genetic material is organized into
chromosomes that reside in the nucleus.

Principles
• Content - codon usage
– often species or class specific
• Signals - PWMs
– principle is the same, signals are different
– Complication of introns/exons

Eukaryotic promoter
TATA boxGC boxCAAT box
5’ 3’
-110 -40 -25 +1mRNA
In addition - transcription factor binding sites
Genes can be enormous!
Controlled by “distant” enhancers

AAUAA
~ 12bp polyA
AAAAA…...
Kozak sequence
At translational start
Polyadenylation sequence
AUG
Signals on the mRNA
STOP

Introns and Exons
Chicken 1a2 collagen gene
has - 38 kb > 50 Introns
Muscular Dystrophy gene is 2.5 Mb and has
? Exons!

Splicing signals
C A T C
A G C T
AGGT AGT N AGG()
>11
5’Exon
3’Exon
GT-AG rule

Exon finding
• Initial exons, from the initiation codon to the first
splice site;
• Internal exons from splice site to splice site;
• Terminal exons from splice site to stop codon;
• Single introns corresponding to uninterrupted,
intronless genes, i.e., running from initiation codon to
stop codon.

Intergrated Gene Parsing
• Search for signals
• Perform a content analysis
• Define the intron/exon boundaries

Gene finding web sites
http://www.tigr.org/~salzberg/appendixa.html
>25 listed sites
GENSCAN
FGENES