Lecture bioinformatics Part2.next generation

MohamedHasan816582 38 views 33 slides Apr 28, 2024
Slide 1
Slide 1 of 33
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33

About This Presentation

Genome


Slide Content

Genes, Genomes, and
Genomics
Bioinformatics in the Classroom
plagiarized from:
http://www.dnalc.org/bioinformatics/presentations
/hhmi_2003/2003_3.ppt
June, 2003

2
Two. Again …
Francis Collins, HGP
Craig Venter, Celera Inc.

3
What’s in a chromosome?

4
Hierarchical vs.Whole Genome

5
The value of genome sequences lies in
their annotation
Annotation –Characterizing genomic
features using computational and
experimental methods
Genes: Four levels of annotation
Gene Prediction –Where are genes?
What do they look like?
Domains –What do the proteins do?
Role –What pathway(s) involved in?

6
How many genes?
Consortium: 35,000 genes?
Celera: 30,000 genes?
Affymetrix: 60,000 human genes on
GeneChips?
Incyte and HGS: over 120,000 genes?
GenBank: 49,000 unique gene coding
sequences?
UniGene: > 89,000 clusters of unique
ESTs?

7
Current consensus (in flux …)
15,000 known genes (similarity to
previously isolated genes and expressed
sequences from a large variety of different
organisms)
17,000 predicted (GenScan, GeneFinder,
GRAIL)
Based on and limited to previous
knowledge

8
How to we get from here …

9
to here,

10
Complete DNA segments responsible to
make functional products
Products
Proteins
Functional RNA molecules
RNAi (interfering RNA)
rRNA (ribosomal RNA)
snRNA (small nuclear)
snoRNA (small nucleolar)
tRNA (transfer RNA)
What are genes? -1

11
What are genes? -2
Definition vs. dynamic concept
Consider
Prokaryotic vs. eukaryotic gene models
Introns/exons
Posttranscriptional modifications
Alternative splicing
Differential expression
Genes-in-genes
Genes-ad-genes
Posttranslational modifications
Multi-subunit proteins

12
Prokaryotic gene model: ORF-genes
“Small” genomes, high gene density
Haemophilus influenzagenome 85% genic
Operons
One transcript, many genes
No introns.
One gene, one protein
Open reading frames
One ORF per gene
ORFs begin with start,
end with stop codon (def.)
TIGR: http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl
NCBI: http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html

13
Eukaryotic gene model: spliced genes
Posttranscriptional modification
5’-CAP, polyA tail, splicing
Open reading frames
Mature mRNA contains ORF
All internal exons contain open “read-through”
Pre-start and post-stop sequences are UTRs
Multiple translates
One gene –many proteins viaalternative splicing

14
Expansions and Clarifications
ORFs
Start –triplets –stop
Prokaryotes: gene = ORF
Eukaryotes: spliced genes or ORF genes
Exons
Remain after introns have been removed
Flanking parts contain non-coding
sequence (5’-and 3’-UTRs)

15
Where do genes live?
In genomes
Example: human genome
Ca. 3,200,000,000 base pairs
25 chromosomes : 1-22, X, Y, mt
28,000-45,000 genes (current estimate)
128 nucleotides (RNA gene) –2,800 kb (DMD)
Ca.25% of genome are genes (introns, exons)
Ca. 1% of genome codes for amino acids (CDS)
30 kb gene length (average)
1.4 kb ORF length (average)
3 transcripts per gene (average)

16
Sample genomes
Species Size GenesGenes/Mb
H.sapiens 3,200Mb35,000 11
D.melanogaster137Mb13.338 97
C.elegans 85.5Mb18,266 214
A.thaliana 115Mb25,800 224
S.cerevisiae15Mb 6,144 410
E.coli 4.6Mb4,300 934
List of 68 eukaryotes, 141 bacteria, and 17 archaea at
http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/links2a.html

17
So much DNA –so “few” genes …s
T

Genic
Intergenic
T C

18
Genomic sequence features
Repeats (“Junk DNA”)
Transposable elements, simple repeats
RepeatMasker
Genes
Vary in density, length, structure
Identification depends on evidence and methods and
may require concerted application of bioinformatics
methods and lab research
Pseudo genes
Look-a-likes of genes, obstruct gene finding efforts.
Non-coding RNAs (ncRNA)
tRNA, rRNA, snRNA, snoRNA, miRNA
tRNASCAN-SE, COVE

19
Homology-based gene prediction
Similarity Searches (e.g.BLAST, BLAT)
Genome Browsers
RNA evidence (ESTs)
Ab initio gene prediction
Gene prediction programs
Prokaryotes
ORF identification
Eukaryotes
Promoter prediction
PolyA-signal prediction
Splice site, start/stop-codon predictions
Gene identification

20
Gene prediction through comparative genomics
Highly similar (Conserved) regions
between two genomes are useful or else
they would have diverged
If genomes are too closely related all
regions are similar, not just genes
If genomes are too far apart, analogous
regions may be too dissimilar to be found

21
Genome Browsers
Generic Genome Browser (CSHL)
www.wormbase.org/db/seq/gbrowse
NCBI Map Viewer
www.ncbi.nlm.nih.gov/mapview/
Ensembl Genome Browser
www.ensembl.org/
Apollo Genome Browser
www.bdgp.org/annot/apollo/
UCSC Genome Browser
genome.ucsc.edu/cgi-bin/hgGateway?org=human

22
Gene discovery using ESTs
Expressed Sequence Tags (ESTs)
represent sequences from expressed
genes.
If region matches EST with high
stringency then region is probably a
gene or pseudo gene.
EST overlapping exon boundary gives
an accurate prediction of exon boundary.

23
Ab initiogene prediction
Prokaryotes
ORF-Detectors
Eukaryotes
Position, extent & direction: through promoter
and polyA-signal predictors
Structure: through splice site predictors
Exact location of coding sequences: through
determination of relationships between
potential start codons, splice sites, ORFs,
and stop codons

24
Tools
ORF detectors
NCBI: http://www.ncbi.nih.gov/gorf/gorf.html
Promoter predictors
CSHL: http://rulai.cshl.org/software/index1.htm
BDGP: fruitfly.org/seq_tools/promoter.html
ICG: TATA-Box predictor
PolyA signal predictors
CSHL: argon.cshl.org/tabaska/polyadq_form.html
Splice site predictors
BDGP: http://www.fruitfly.org/seq_tools/splice.html
Start-/stop-codon identifiers
DNALC: Translator/ORF-Finder
BCM: Searchlauncher

25
How it works I –Motif identification
Exon-Intron Borders = Splice Sites
Exon Intron Exon
~~gaggcatcag|gtttgtagac~~~~~~~~~~~tgtgtttcag |tgcacccact~~
~~ccgccgctga|gtgagccgtg~~~~~~~~~~~tctattctag |gacgcgcggg~~
~~tgtgaattag|gtaagaggtt~~~~~~~~~~~atatctccag |atggagatca~~
~~ccatgaggag|gtgagtgcca~~~~~~~~~~~ttatttccag |gtatgagacg~~
Splice site Splice site
Exon Intron Exon
~~gaggcatcag|GTttgtagac~~~~~~~~~~~tgtgtttc AG|tgcacccact~~
~~ccgccgctga|GTgagccgtg~~~~~~~~~~~tctattct AG|gacgcgcggg~~
~~tgtgaattag|GTaagaggtt~~~~~~~~~~~atatctcc AG|atggagatca~~
~~ccatgaggag|GTgagtgcca~~~~~~~~~~~ttatttcc AG|gtatgagacg~~
Splice site Splice site
Motif Extraction Programs at http://www-btls.jst.go.jp/

26
How it works II -Movies
Pribnow-Box Finder0/1
Pribnow-Box Finderall

27
How it works III –The (ugly) truth

28
Gene prediction programs
Rule-based programs
Use explicit set of rules to make decisions.
Example: GeneFinder
Neural Network-based programs
Use data set to build rules.
Examples: Grail, GrailEXP
Hidden Markov Model-based programs
Use probabilities of states and transitions
between these states to predict features.
Examples: Genscan, GenomeScan

29
Evaluating prediction programs
Sensitivity vs. Specificity
Sensitivity
How many genes were found out of all
present?
Sn = TP/(TP+FN)
Specificity
How many predicted genes are indeed genes?
Sp = TP/(TP+FP)

30
Gene prediction accuracies
Nucleotide level: 95%Sn, 90%Sp (Lows less than
50%)
Exon level: 75%Sn, 68%Sp (Lows less than 30%)
Gene Level: 40% Sn, 30%Sp (Lows less than 10%)
Programs that combine statistical evaluations with
similarity searches most powerful.

31
Common difficulties
First and last exons difficult to annotate
because they contain UTRs.
Smaller genes are not statistically significant so
they are thrown out.
Algorithms are trained with sequences from
known genes which biases them against genes
about which nothing is known.
Masking repeats frequently removes potentially
indicative chunks from the untranslated regions
of genes that contain repetitive elements.

32
The annotation pipeline
Mask repeats using RepeatMasker.
Run sequence through several programs.
Take predicted genes and do similarity
search against ESTs and genes from
other organisms.
Do similarity search for non-coding
sequences to find ncRNA.

33
Annotation nomenclature
Known Gene–Predicted gene matches the
entire length of a known gene.
Putative Gene–Predicted gene contains region
conserved with known gene. Also referred to as
“like” or “similar to”.
Unknown Gene–Predicted gene matches a
gene or EST of which the function is not known.
Hypothetical Gene–Predicted gene that does
not contain significant similarity to any known
gene or EST.
Tags