Computational Prediction of Orthologs
Melvin Zhang
School of Computing,
National University of Singapore
May 4, 2011
A gene is a unit of heredity in a living organism
One gene may encode for multiple proteins
Two genes are homologous if they descended from
a common ancestral gene
1
In practice, homology is determined using.
Figure:
Have you seen phrases like,
homology", or?
1
with respect to a specic speciation event
Two genes are homologous if they descended from
a common ancestral gene
1
In practice, homology is determined using.
Figure:
Have you seen phrases like,
homology", or?
1
with respect to a specic speciation event
Two genes are homologous if they descended from
a common ancestral gene
1
In practice, homology is determined using.
Figure:
Have you seen phrases like,
homology", or?
1
with respect to a specic speciation event
Orthologs are due to speciation, paralogs are due
to duplicationMRCA ofGandHGHspeciationduplicationmain orthologsorthologsghh
0paralogs
Orthologs maintain their function
Annotate genes with unknown
functions.
Infer protein-protein
interactions.
Orthologs maintain their function
Annotate genes with unknown
functions.
Infer protein-protein
interactions.
Orthologs are not one-to-one due to lineage
specic gene duplications
Main orthologs
position.
2MRCA ofGandHGHspeciationduplicationmain orthologsorthologsghh
0paralogs
2
Burgetzet al., Evolutionary Bioinformatics 2006
Problem of identifying main orthologs
Input
Output
direct descendant inGandH
Complications
Igene duplication
Igene loss
Ihorizontal gene transfer
Igene fusion, ssion
Problem of identifying main orthologs
Input
Output
direct descendant inGandH
Complications
Igene duplication
Igene loss
Ihorizontal gene transfer
Igene fusion, ssion
Three main approaches for nding orthologs
Graph based Tree basedRearrangement based
Bidirectional Best Hit and variants
Most popular approach. High
level of functional relatedness.
a
Reciprocal smallest dist
use evolutionary distance
estimate instead of BLAST
scores
OMA stable pairs
introduce a tolerance interval
and stable matching
a
Altenhoet al., PLoS CB 2009
EnsemblCompara GeneTrees
3
Figure: A
Based on reconciliation of gene trees with species tree.
1.
2.
3
Vilellaet al., Genome Res 2009
Can conserved gene neighborhood improve
ortholog predictions?
Human-mouse synteny blocks
Conserved synteny blocks between human and mouse genome
generated by the Cinteny web server
5
5
Sinha and Meller, BMC Bioinformatics 2007
Local synteny criteria
6
Figure:
genes. Homology dened as BLASTP E-value<1e-5
94% of sampled inter-species pairs are identied as orthologs
by Inparanoid (based on BBH) and local synteny criteria.
6
Jin Junet al., BMC Genomics 2009
Local synteny score (LC)GHgh
The local synteny score ofgandhis 4 since there are 4 edges
in the maximum matching.
Smith-Waterman alignment score (SW)
BBH-LS: bidirectional best hits based on linear
combination of SW and LCGHgh
+
sim(g;h) = (1f)SW(g;h)+fLC(g;h)
Human-Mouse-Rat dataset
Input
Human, mouse, and rat genes downloaded from Ensembl.
Benchmark
No \golden" benchmark for true orthology.
Assume that orthologs are assigned the same gene symbol.
Tuning the BBH-LS method
sim(g;h) = (1f)SW(g;h) +fLC(g;h)
Figure:
similarity to sequence similarity on the human-mouse dataset.
Results for various methods on Human-Mouse
Figure:
More true positives and less false positives than MSOAR2.
Results for various methods on Human-Rat
Figure:
Results for various methods on Mouse-Rat
Figure:
How local synteny helpsCTSHMSH3CKMT2RASGRF2MSH3RASGRF1ANKRD34CRASGRF2ANKRD34CRASGRF1CKMT2CTSHsw = 5265ls = 1sw = 2003ls = 5sw = 2466ls = 5
Human
chr 15
Human
chr 5
Mouse
chr 9
Mouse
chr 13
Bold edges are the pairing from BBH-LS, thin edges are the
pairing from BBH.
BBH paired RASGRF2 (human) to RASGRF1 (mouse) due to
high SW, corrected by BBH-LS with LC.
Summary: Identifying main orthologsMRCA ofGandHGHspeciationduplicationmain orthologsorthologsghh
0paralogs
For each gene in their common ancestor, nd its direct
descendant inGandH
Summary: Three approaches
Graph based Tree basedRearrangement based
BBH-LS: bidirectional best hits based on linear
combination of SW and LCGHgh
+
BBH-LS: bidirectional best hits based on linear
combination of SW and LCGHgh
+