Dot plots
Dr Avril Coghlan [email protected]
Note: this talk contains animations which can only be seen by
downloading and using ‘View Slide show’ in Powerpoint
Dot plots
•How can we compare the human & Drosophila
melanogaster Eyeless protein sequences?
One method is a dotplot
•A dotplot is a graphical method for assessing
similarity
Make a matrix (table) with one row for each letter in sequence 1, & one
column for each letter in sequence 2
Colour in each cell with an identical letter in the 2 sequences
Regions of local similarity between the 2 sequences appear as diagonal
lines of coloured cells (‘dots’)
eg. for sequences ‘RQQEPVRSTC’ and ‘QQESGPVRST’:
Regions of local similarity between the 2 sequences appear as
diagonal lines
Some off-diagonal dots may be due to chance similarities
Sequence 2
Sequence 1
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
Problem
•Make a dot-plot for DNA sequences “GCATCGGC” &
“CCATCGCCATCG”. Are there regions of similarity?
Answer
•Make a dot-plot for DNA sequences “GCATCGGC” &
“CCATCGCCATCG”. Are there regions of similarity?
CATCG in sequence 1 appears twice in sequence 2
CCATCGCCATCG
G
C
A
T
C
G
G
C
•If you colour in all cells with an identical letter, some
dots may be due to chance similarities
•Therefore, it is common to use a threshold to decide
whether to plot a ‘dot’ in a cell
A window of a certain size (eg. window size = 3) is moved up all possible
diagonals, one-by-one
A score is calculated for each position of the window on a diagonal :
the number of identical letters in the window
If the score is equal to or above the threshold (eg. threshold = score of
2), all the cells in the window are coloured in
The choice of values for the window size and threshold for the dot plot
are chosen by trial-and-error
Dot plots with thresholds
Score = 1, < thresholdScore = 0, < thresholdScore = 0, < thresholdScore = 1, < thresholdScore = 2, ≥ thresholdScore = 2, ≥ threshold → colour inScore = 2, ≥ threshold → colour inScore = 2, ≥ threshold → colour inScore = 2, ≥ thresholdScore = 3, ≥ threshold → colour inScore = 3, ≥ threshold
eg. for sequences “GCATCGGC” and “CCATCGCCATCG” , using a window
size of 3, and a threshold of ≥2:
and so on....
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 0, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 0, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 1, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 1, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 1, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 1, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 0, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 0, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 0, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 0, < threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
Score = 2, ≥ threshold
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
CCATCGCCATCG
G
C
A
T
C
G
G
C
= the sliding window
CCATCGCCATCG
G
C
A
T
C
G
G
C
•A dot plot of fruitfly & human Eyeless proteins:
Do you think we chose a good value for the
window-size and threshold?
Real data: fruitfly & human Eyeless
Human Eyeless
F
r
u
i
t
f
l
y
E
y
e
l
e
s
s
Window-size = 10,
Threshold = 3
Real data: fruitfly & human Eyeless
•Here is a dot plot of fruitfly and human Eyeless
proteins, made using windowsize=10, threshold=5:
Are there any regions of similarity?
Human Eyeless
F
r
u
i
t
f
l
y
E
y
e
l
e
s
s
Window-size = 10,
Threshold = 5
•Advantages
A dot plot can be used to identify long regions of strong similarity
between two sequences
It produces a plot, which is easy to make and to interpret
It can be used to compare very short or long sequences (even whole
chromosomes – millions of bases)
•Disadvantages
It is necessary to find the best window size and threshold by trial-and-
error
A dot plot can only be used to compare 2 sequences, not >2 sequences
It doesn’t tell you what mutations occurred in the region of
similarity (if there is one) since the two sequences shared a
common ancestor
Pros and cons of dot plots
•dotPlot() function in the SeqinR R library
Allows you to specify a windowsize and threshold
If the score in a window is ≥ than the threshold, colours in the 1
st
cell in
the window (not all cells)
•EMBOSS dottup
Allows you to specify a windowsize but not a threshold
If all cells in a window are identities, it colours in all cells in the window
•EMBOSS dotmatcher
Allows you to specify a windowsize and threshold
Instead of using the number of identities in a window as the window
score, it calculates a more complex score based on the
similarities of the bases/amino acids
Software for making dotplots
Problem
•Make a dot-plot for amino acid sequences
“RQQEPVRSTC” and “QQESGPVRST”, using a
window size of 3, and a threshold of ≥3
Answer
•Make a dot-plot for sequences “RQQEPVRSTC” and “QQESGPVRST”,
using window size: 3, threshold: ≥3
QQESGPVRST
R
Q
Q
E
P
V
R
S
T
C
Further reading
•Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn
•Practical on dotplots in R in the Little Book of R for Bioinformatics:
https://a-little-book-of-r-for-
bioinformatics.readthedocs.org/en/latest/src/chapter4.html