Dotplots for Bioinformatics

65,107 views 14 slides Feb 13, 2013
Slide 1
Slide 1 of 14
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14

About This Presentation

No description available for this slideshow.


Slide Content

Dot plots
Dr Avril Coghlan
[email protected]
Note: this talk contains animations which can only be seen by
downloading and using ‘View Slide show’ in Powerpoint

Dot plots
•How can we compare the human & Drosophila
melanogaster Eyeless protein sequences?
One method is a dotplot
•A dotplot is a graphical method for assessing
similarity
Make a matrix (table) with one row for each letter in sequence 1, & one
column for each letter in sequence 2
Colour in each cell with an identical letter in the 2 sequences
Regions of local similarity between the 2 sequences appear as diagonal
lines of coloured cells (‘dots’)

eg. for sequences ‘RQQEPVRSTC’ and ‘QQESGPVRST’:
Regions of local similarity between the 2 sequences appear as
diagonal lines
Some off-diagonal dots may be due to chance similarities
Sequence 2
Sequence 1
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C

Problem
•Make a dot-plot for DNA sequences “GCATCGGC” &
“CCATCGCCATCG”. Are there regions of similarity?

Answer
•Make a dot-plot for DNA sequences “GCATCGGC” &
“CCATCGCCATCG”. Are there regions of similarity?
CATCG in sequence 1 appears twice in sequence 2
CCATCGCCATCG
G
C
A
T
C
G
G
C

•If you colour in all cells with an identical letter, some
dots may be due to chance similarities
•Therefore, it is common to use a threshold to decide
whether to plot a ‘dot’ in a cell
A window of a certain size (eg. window size = 3) is moved up all possible
diagonals, one-by-one
A score is calculated for each position of the window on a diagonal :
the number of identical letters in the window
If the score is equal to or above the threshold (eg. threshold = score of
2), all the cells in the window are coloured in
The choice of values for the window size and threshold for the dot plot
are chosen by trial-and-error
Dot plots with thresholds

Score = 1, < thresholdScore = 0, < thresholdScore = 0, < thresholdScore = 1, < thresholdScore = 2, ≥ thresholdScore = 2, ≥ threshold → colour inScore = 2, ≥ threshold → colour inScore = 2, ≥ threshold → colour inScore = 2, ≥ thresholdScore = 3, ≥ threshold → colour inScore = 3, ≥ threshold
eg. for sequences “GCATCGGC” and “CCATCGCCATCG” , using a window
size of 3, and a threshold of ≥2:
and so on....
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 0, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 0, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 1, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 1, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 1, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 1, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 0, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 0, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 0, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 0, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 2, ≥ threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
= the sliding window
CCATCGCCATCG
G
C
A
T
C
G
G
C

•A dot plot of fruitfly & human Eyeless proteins:
Do you think we chose a good value for the
window-size and threshold?
Real data: fruitfly & human Eyeless
Human Eyeless
F
r
u
i
t
f
l
y

E
y
e
l
e
s
s
Window-size = 10,
Threshold = 3

Real data: fruitfly & human Eyeless
•Here is a dot plot of fruitfly and human Eyeless
proteins, made using windowsize=10, threshold=5:
Are there any regions of similarity?
Human Eyeless
F
r
u
i
t
f
l
y

E
y
e
l
e
s
s
Window-size = 10,
Threshold = 5

•Advantages
A dot plot can be used to identify long regions of strong similarity
between two sequences
It produces a plot, which is easy to make and to interpret
It can be used to compare very short or long sequences (even whole
chromosomes – millions of bases)
•Disadvantages
It is necessary to find the best window size and threshold by trial-and-
error
A dot plot can only be used to compare 2 sequences, not >2 sequences
It doesn’t tell you what mutations occurred in the region of
similarity (if there is one) since the two sequences shared a
common ancestor
Pros and cons of dot plots

•dotPlot() function in the SeqinR R library
Allows you to specify a windowsize and threshold
If the score in a window is ≥ than the threshold, colours in the 1
st
cell in
the window (not all cells)
•EMBOSS dottup
Allows you to specify a windowsize but not a threshold
If all cells in a window are identities, it colours in all cells in the window
•EMBOSS dotmatcher
Allows you to specify a windowsize and threshold
Instead of using the number of identities in a window as the window
score, it calculates a more complex score based on the
similarities of the bases/amino acids
Software for making dotplots

Problem
•Make a dot-plot for amino acid sequences
“RQQEPVRSTC” and “QQESGPVRST”, using a
window size of 3, and a threshold of ≥3

Answer
•Make a dot-plot for sequences “RQQEPVRSTC” and “QQESGPVRST”,
using window size: 3, threshold: ≥3
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C

Further reading
•Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn
•Practical on dotplots in R in the Little Book of R for Bioinformatics:
https://a-little-book-of-r-for-
bioinformatics.readthedocs.org/en/latest/src/chapter4.html