Genes, Genomes, and
Genomics
Bioinformatics in the Classroom
plagiarized from:
http://www.dnalc.org/bioinformatics/presentations
/hhmi_2003/2003_3.ppt
June, 2003
2
Two. Again …
Francis Collins, HGP
Craig Venter, Celera Inc.
3
What’s in a chromosome?
4
Hierarchical vs.Whole Genome
5
The value of genome sequences lies in
their annotation
Annotation –Characterizing genomic
features using computational and
experimental methods
Genes: Four levels of annotation
Gene Prediction –Where are genes?
What do they look like?
Domains –What do the proteins do?
Role –What pathway(s) involved in?
6
How many genes?
Consortium: 35,000 genes?
Celera: 30,000 genes?
Affymetrix: 60,000 human genes on
GeneChips?
Incyte and HGS: over 120,000 genes?
GenBank: 49,000 unique gene coding
sequences?
UniGene: > 89,000 clusters of unique
ESTs?
7
Current consensus (in flux …)
15,000 known genes (similarity to
previously isolated genes and expressed
sequences from a large variety of different
organisms)
17,000 predicted (GenScan, GeneFinder,
GRAIL)
Based on and limited to previous
knowledge
8
How to we get from here …
9
to here,
10
Complete DNA segments responsible to
make functional products
Products
Proteins
Functional RNA molecules
RNAi (interfering RNA)
rRNA (ribosomal RNA)
snRNA (small nuclear)
snoRNA (small nucleolar)
tRNA (transfer RNA)
What are genes? -1
11
What are genes? -2
Definition vs. dynamic concept
Consider
Prokaryotic vs. eukaryotic gene models
Introns/exons
Posttranscriptional modifications
Alternative splicing
Differential expression
Genes-in-genes
Genes-ad-genes
Posttranslational modifications
Multi-subunit proteins
12
Prokaryotic gene model: ORF-genes
“Small” genomes, high gene density
Haemophilus influenzagenome 85% genic
Operons
One transcript, many genes
No introns.
One gene, one protein
Open reading frames
One ORF per gene
ORFs begin with start,
end with stop codon (def.)
TIGR: http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl
NCBI: http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html
20
Gene prediction through comparative genomics
Highly similar (Conserved) regions
between two genomes are useful or else
they would have diverged
If genomes are too closely related all
regions are similar, not just genes
If genomes are too far apart, analogous
regions may be too dissimilar to be found
22
Gene discovery using ESTs
Expressed Sequence Tags (ESTs)
represent sequences from expressed
genes.
If region matches EST with high
stringency then region is probably a
gene or pseudo gene.
EST overlapping exon boundary gives
an accurate prediction of exon boundary.
23
Ab initiogene prediction
Prokaryotes
ORF-Detectors
Eukaryotes
Position, extent & direction: through promoter
and polyA-signal predictors
Structure: through splice site predictors
Exact location of coding sequences: through
determination of relationships between
potential start codons, splice sites, ORFs,
and stop codons
25
How it works I –Motif identification
Exon-Intron Borders = Splice Sites
Exon Intron Exon
~~gaggcatcag|gtttgtagac~~~~~~~~~~~tgtgtttcag |tgcacccact~~
~~ccgccgctga|gtgagccgtg~~~~~~~~~~~tctattctag |gacgcgcggg~~
~~tgtgaattag|gtaagaggtt~~~~~~~~~~~atatctccag |atggagatca~~
~~ccatgaggag|gtgagtgcca~~~~~~~~~~~ttatttccag |gtatgagacg~~
Splice site Splice site
Exon Intron Exon
~~gaggcatcag|GTttgtagac~~~~~~~~~~~tgtgtttc AG|tgcacccact~~
~~ccgccgctga|GTgagccgtg~~~~~~~~~~~tctattct AG|gacgcgcggg~~
~~tgtgaattag|GTaagaggtt~~~~~~~~~~~atatctcc AG|atggagatca~~
~~ccatgaggag|GTgagtgcca~~~~~~~~~~~ttatttcc AG|gtatgagacg~~
Splice site Splice site
Motif Extraction Programs at http://www-btls.jst.go.jp/
26
How it works II -Movies
Pribnow-Box Finder0/1
Pribnow-Box Finderall
27
How it works III –The (ugly) truth
28
Gene prediction programs
Rule-based programs
Use explicit set of rules to make decisions.
Example: GeneFinder
Neural Network-based programs
Use data set to build rules.
Examples: Grail, GrailEXP
Hidden Markov Model-based programs
Use probabilities of states and transitions
between these states to predict features.
Examples: Genscan, GenomeScan
29
Evaluating prediction programs
Sensitivity vs. Specificity
Sensitivity
How many genes were found out of all
present?
Sn = TP/(TP+FN)
Specificity
How many predicted genes are indeed genes?
Sp = TP/(TP+FP)
30
Gene prediction accuracies
Nucleotide level: 95%Sn, 90%Sp (Lows less than
50%)
Exon level: 75%Sn, 68%Sp (Lows less than 30%)
Gene Level: 40% Sn, 30%Sp (Lows less than 10%)
Programs that combine statistical evaluations with
similarity searches most powerful.
31
Common difficulties
First and last exons difficult to annotate
because they contain UTRs.
Smaller genes are not statistically significant so
they are thrown out.
Algorithms are trained with sequences from
known genes which biases them against genes
about which nothing is known.
Masking repeats frequently removes potentially
indicative chunks from the untranslated regions
of genes that contain repetitive elements.
32
The annotation pipeline
Mask repeats using RepeatMasker.
Run sequence through several programs.
Take predicted genes and do similarity
search against ESTs and genes from
other organisms.
Do similarity search for non-coding
sequences to find ncRNA.
33
Annotation nomenclature
Known Gene–Predicted gene matches the
entire length of a known gene.
Putative Gene–Predicted gene contains region
conserved with known gene. Also referred to as
“like” or “similar to”.
Unknown Gene–Predicted gene matches a
gene or EST of which the function is not known.
Hypothetical Gene–Predicted gene that does
not contain significant similarity to any known
gene or EST.