Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 1
1
The Ensembl Database
http://www.ensembl.org
Lecture 7.1 2
Ensembl is a genome browser for vertebrate genomes that
supports research in comparative genomics, evolution,
sequence variation and transcriptional regulation.
Ensembl annotate genes, computes multiple alignments,
predicts regulatory function and collects disease data.
Ensembl
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 2
3
What is Ensembl?
•Public annotation of mammalian and other
genomes
•Open source software
4
The Ensembl Project
“Ensembl is a joint project between EMBL
European Bioinformatics Institute and the
Sanger Institute to develop a software
system which produces and maintains
automatic annotation on eukaryotic
genomes. Ensembl is primarily funded by
the Wellcome Trust”
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 3
5
The Ensembl Project
“The main aim of this campaign is to
encourage scientists across the world -in
academia, pharmaceutical companies, and
the biotechnology and computer industries -
to use this free information.”
-Dr. Mike Dexter, Director of the Wellcome Trust
6
Diagram of
contigview as
“what we want
in the end”
Goal: An Accessible, Annotated
Genome
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 4
7
Ensembl Genome Annotation
•Utilizes raw DNA sequence data from public sources
•Creates a tracking database (The “Ensembl database”)
•Joins the sequences -based on a sequence scaffold
•Automatically finds genes and other features of the sequence
•Associates sequence and features with data from other sources
•Provides a publicly accessible web based interface to the database
8
The Genome Problem
•The problem with the genome (particularly
human) is that it is “large, complicated, and
opaque to analysis” (Ewan Birney, Ensembl)
•Genome features to identify include:
–Genes: protein coding, RNA, pseudogenes
–Regulatory elements
–SNPs, repeats, etc….
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 5
9
DNA sequence in Ensembl
•Sequences are determined in fragments (contigs)
•Features cross boundaries between fragments
•Entire sequence too large and changes too much
(constantly updated and reassembled) to be stored
as one long database entry
10
DNA sequence in Ensembl
•Core design feature is the “virtual contig”
object
•Allows genome sequence to be accessed as
a single large contiguous sequence even
though it is stored as a collection of fragments
•VC object handles reading and writing
features to the DNA sequence
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 6
11
Ensembl Gene Build System
•Three-part gene build system
–“Best in genome” matches for known genes
–Alignment of homologous genes
–Ab initiogene finding
•Genes predicted on repeat-masked DNA
•All genes predicted based on experimental
(available sequence) evidence
12
“Best in genome” predictions
•Find known proteins from SwissProtTrEMBL
on genome
•Incorporate cDNAsusing exonerate and
EST_genome
–Align with gaps placed preferentially at splice
consensus sites
–Allows prediction of 5’ and 3’ UTRs
•Refine predictions using genewise
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 7
Lecture 7.1 13
“Best in genome” predictions
ContigView of best in genome gene
with associated evidence
Known gene
(p53)
Proteins aligned
cDNAs aligned
UTRs predicted
Unigene clusters aligned
•Alignments shown in ContigView
14
Homology predictions
•Align homologous proteins using BLAST,
genewise
–Paralogs (from same organism)
–Orthologs (from closely related organisms)
•Assemble novel genes
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 8
Lecture 7.1 15
Ab initiogene predictions
•Use Genscan to identify novel exons
•Confirm exons by BLAST to known proteins, mRNAs,
UniGene clusters
•Based on ab initiopredictions but require homology
evidence
ContigView of homology gene with
associated evidence
Novel gene
GenScan predictions
Proteins aligned
Unigene clusters aligned
Lecture 7.1 16
Pseudogenes
•Many pseudogenes also predicted
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 9
17
Manual gene annotation: Otter
•Manual annotation
done with applications
eg. Apollo
•Otter database/server
allows manual
annotations to be
integrated with
automated annotations
18
Manually curated genes: VEGA
•Chromosomes
6,7,13,14, 20
and 22
contain
manually
curated genes
from VEGA
database
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 10
Lecture 7.1 19
Gene information in Ensembl:
GeneView
Lecture 7.1 20
Transcript information in Ensembl:
TransView
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 11
Lecture 7.1 21
Protein information in Ensembl:
ProteinView
22
Comparative genomics in Ensembl
Gene orthologue pairs:
•Human <-> Mouse <-> Rat
<-> Fugu <-> Zebrafish
•C. elegans<-> C. briggsae
•Fly <-> Mosquito
DNA homology:
•Human <-> Mouse <-> Rat
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 12
Lecture 7.1 23
Comparative genomics in
Ensembl: Gene orthologs
•Gene ortholog pairs shown in GeneView
•Calculated by BLAST (reciprocal best BLAST hits, or
BLAST + synteny)
•dN/dS = nonsynonymous/synonymous change
(measure of selection)
24
Comparative genomics in
Ensembl: DNA homology
•DNA homology shown in ContigView
Mouse and rat homology
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 13
25
Comparative genomics in
Ensembl: Synteny
•Large-scale homology
shown in SyntenyView
–Synteny = homologous
sequence blocks, in
same order and
orientation
26
Other features in Ensembl
•Menus
provide
other feature
options
•Features eg.
SNPs and
markers
have special
views
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 14
Lecture 7.1 27
Other data sources in Ensembl
•Ensembl incorporates gene and feature info
from many other datasources
OMIM
SwissProt
Lecture 7.1 28
Other data sources in Ensembl:
Link out
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 15
29
The Distributed Annotation
System
•Allows viewing third-party annotation of the
genomic scaffold
•Users can choose the annotation they are
interested in
•Features are viewed in consistent user
interface/display
•Allows specialized feature annotation and the
comparison of different methodologies
30
Sequence similarity searching
•Two search methods
–SSAHA: very fast, good for identifying near-exact
DNA-DNA matches
–BLAST: slower but more accurate, can do DNA or
protein searches
•Can search against any species
•Can search against genomic sequence,
cDNAs (Ensembl or Genscan), or protein
sequences
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 16
Lecture 7.1 31
Show alignment
[A], sequence [S],
or ContigView [C]
Hits relative to
genome
32
Ensembl updates
•Monthly
•Include:
–Changes in genome builds (with new annotations)
–Changes in code or database schema
–Additional views and tools on website
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 17
33
Pre-Ensembl
•Full annotation can take weeks
•Pre-Ensembl site provides in-progress annotation
–Placement of known proteins
–Ab initio gene predictions
–Repeat masking
–BLAST and SSAHA searching
34
Ensembl Software System
•Software can be accessed by FTP
•Can also be accessed through CVS
(concurrent versions system)
•Possible to set up a mirror of the entire
Ensembl system.