Dot plots

                    Dr Avril Coghlan
                   alc@sanger.ac.uk

Note: this talk contains animations which can only be seen by
downloading and using ‘View Slide show’ in Powerpoint
Dot plots
• How can we compare the human & Drosophila
  melanogaster Eyeless protein sequences?
  One method is a dotplot
• A dotplot is a graphical method for assessing
  similarity
  Make a matrix (table) with one row for each letter in sequence 1, & one
       column for each letter in sequence 2
  Colour in each cell with an identical letter in the 2 sequences
  Regions of local similarity between the 2 sequences appear as diagonal
       lines of coloured cells (‘dots’)
eg. for sequences ‘RQQEPVRSTC’ and ‘QQESGPVRST’:

                   Q   Q    E   S   G    P   V    R   S   T          Sequence 2
               R
               Q
               Q
               E
Sequence 1
               P
               V
               R
               S
               T
               C

     Regions of local similarity between the 2 sequences appear as
     diagonal lines
     Some off-diagonal dots may be due to chance similarities
Problem
• Make a dot-plot for DNA sequences “GCATCGGC” &
  “CCATCGCCATCG”. Are there regions of similarity?
Answer
• Make a dot-plot for DNA sequences “GCATCGGC” &
  “CCATCGCCATCG”. Are there regions of similarity?
       C    C   A   T   C   G    C   C   A   T      C   G
   G
   C
   A
   T
   C
   G
   G
   C

  CATCG in sequence 1 appears twice in sequence 2
Dot plots with thresholds
• If you colour in all cells with an identical letter, some
  dots may be due to chance similarities
• Therefore, it is common to use a threshold to decide
  whether to plot a ‘dot’ in a cell
  A window of a certain size (eg. window size = 3) is moved up all possible
        diagonals, one-by-one
  A score is calculated for each position of the window on a diagonal :
        the number of identical letters in the window
  If the score is equal to or above the threshold (eg. threshold = score of
        2), all the cells in the window are coloured in
  The choice of values for the window size and threshold for the dot plot
        are chosen by trial-and-error
eg. for sequences “GCATCGGC” and “CCATCGCCATCG” , using a window
      size of 3, and a threshold of ≥2:


          C   C   A   T   C   G   C   C     A    T   C    G
      G
      C
      A
      T
      C
      G
      G
      C

          Score = 2, ≥ threshold → colour in
                  3, <
                  0,
                  1,

  = the sliding window                    and so on....
Real data: fruitfly & human Eyeless
• A dot plot of fruitfly & human Eyeless proteins:
        Fruitfly Eyeless



                                           Window-size = 10,
                                           Threshold = 3




                           Human Eyeless
  Do you think we chose a good value for the
  window-size and threshold?
Real data: fruitfly & human Eyeless
• Here is a dot plot of fruitfly and human Eyeless
  proteins, made using windowsize=10, threshold=5:
     Fruitfly Eyeless




                                         Window-size = 10,
                                         Threshold = 5




                        Human Eyeless
  Are there any regions of similarity?
Pros and cons of dot plots
• Advantages
  A dot plot can be used to identify long regions of strong similarity
  between two sequences
  It produces a plot, which is easy to make and to interpret
  It can be used to compare very short or long sequences (even whole
        chromosomes – millions of bases)
• Disadvantages
  It is necessary to find the best window size and threshold by trial-and-
  error
  A dot plot can only be used to compare 2 sequences, not >2 sequences
  It doesn’t tell you what mutations occurred in the region of
  similarity (if there is one) since the two sequences shared a
  common ancestor
Software for making dotplots
• dotPlot() function in the SeqinR R library
  Allows you to specify a windowsize and threshold
  If the score in a window is ≥ than the threshold, colours in the 1st cell in
        the window (not all cells)
• EMBOSS dottup
  Allows you to specify a windowsize but not a threshold
  If all cells in a window are identities, it colours in all cells in the window
• EMBOSS dotmatcher
  Allows you to specify a windowsize and threshold
  Instead of using the number of identities in a window as the window
        score, it calculates a more complex score based on the
  similarities of the bases/amino acids
Problem
• Make a dot-plot for amino acid sequences
  “RQQEPVRSTC” and “QQESGPVRST”, using a
  window size of 3, and a threshold of ≥3
Answer
•   Make a dot-plot for sequences “RQQEPVRSTC” and “QQESGPVRST”,
    using window size: 3, threshold: ≥3

                Q   Q   E   S   G   P   V   R   S   T
            R
            Q
            Q
            E
            P
            V
            R
            S
            T
            C
Further reading
•   Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn
•   Practical on dotplots in R in the Little Book of R for Bioinformatics:
    https://a-little-book-of-r-for-
    bioinformatics.readthedocs.org/en/latest/src/chapter4.html

Dotplots for Bioinformatics

  • 1.
    Dot plots Dr Avril Coghlan alc@sanger.ac.uk Note: this talk contains animations which can only be seen by downloading and using ‘View Slide show’ in Powerpoint
  • 2.
    Dot plots • Howcan we compare the human & Drosophila melanogaster Eyeless protein sequences? One method is a dotplot • A dotplot is a graphical method for assessing similarity Make a matrix (table) with one row for each letter in sequence 1, & one column for each letter in sequence 2 Colour in each cell with an identical letter in the 2 sequences Regions of local similarity between the 2 sequences appear as diagonal lines of coloured cells (‘dots’)
  • 3.
    eg. for sequences‘RQQEPVRSTC’ and ‘QQESGPVRST’: Q Q E S G P V R S T Sequence 2 R Q Q E Sequence 1 P V R S T C Regions of local similarity between the 2 sequences appear as diagonal lines Some off-diagonal dots may be due to chance similarities
  • 4.
    Problem • Make adot-plot for DNA sequences “GCATCGGC” & “CCATCGCCATCG”. Are there regions of similarity?
  • 5.
    Answer • Make adot-plot for DNA sequences “GCATCGGC” & “CCATCGCCATCG”. Are there regions of similarity? C C A T C G C C A T C G G C A T C G G C CATCG in sequence 1 appears twice in sequence 2
  • 6.
    Dot plots withthresholds • If you colour in all cells with an identical letter, some dots may be due to chance similarities • Therefore, it is common to use a threshold to decide whether to plot a ‘dot’ in a cell A window of a certain size (eg. window size = 3) is moved up all possible diagonals, one-by-one A score is calculated for each position of the window on a diagonal : the number of identical letters in the window If the score is equal to or above the threshold (eg. threshold = score of 2), all the cells in the window are coloured in The choice of values for the window size and threshold for the dot plot are chosen by trial-and-error
  • 7.
    eg. for sequences“GCATCGGC” and “CCATCGCCATCG” , using a window size of 3, and a threshold of ≥2: C C A T C G C C A T C G G C A T C G G C Score = 2, ≥ threshold → colour in 3, < 0, 1, = the sliding window and so on....
  • 8.
    Real data: fruitfly& human Eyeless • A dot plot of fruitfly & human Eyeless proteins: Fruitfly Eyeless Window-size = 10, Threshold = 3 Human Eyeless Do you think we chose a good value for the window-size and threshold?
  • 9.
    Real data: fruitfly& human Eyeless • Here is a dot plot of fruitfly and human Eyeless proteins, made using windowsize=10, threshold=5: Fruitfly Eyeless Window-size = 10, Threshold = 5 Human Eyeless Are there any regions of similarity?
  • 10.
    Pros and consof dot plots • Advantages A dot plot can be used to identify long regions of strong similarity between two sequences It produces a plot, which is easy to make and to interpret It can be used to compare very short or long sequences (even whole chromosomes – millions of bases) • Disadvantages It is necessary to find the best window size and threshold by trial-and- error A dot plot can only be used to compare 2 sequences, not >2 sequences It doesn’t tell you what mutations occurred in the region of similarity (if there is one) since the two sequences shared a common ancestor
  • 11.
    Software for makingdotplots • dotPlot() function in the SeqinR R library Allows you to specify a windowsize and threshold If the score in a window is ≥ than the threshold, colours in the 1st cell in the window (not all cells) • EMBOSS dottup Allows you to specify a windowsize but not a threshold If all cells in a window are identities, it colours in all cells in the window • EMBOSS dotmatcher Allows you to specify a windowsize and threshold Instead of using the number of identities in a window as the window score, it calculates a more complex score based on the similarities of the bases/amino acids
  • 12.
    Problem • Make adot-plot for amino acid sequences “RQQEPVRSTC” and “QQESGPVRST”, using a window size of 3, and a threshold of ≥3
  • 13.
    Answer • Make a dot-plot for sequences “RQQEPVRSTC” and “QQESGPVRST”, using window size: 3, threshold: ≥3 Q Q E S G P V R S T R Q Q E P V R S T C
  • 14.
    Further reading • Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn • Practical on dotplots in R in the Little Book of R for Bioinformatics: https://a-little-book-of-r-for- bioinformatics.readthedocs.org/en/latest/src/chapter4.html

Editor's Notes

  • #4 In R: setwd(&quot;C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/MScCourseLectures/MB6301Lectures/MB6301_Ls3456_Aln&quot;) library(&quot;seqinr&quot;) seq1 &lt;- “RQQEPVRSTC” seq2 &lt;- “QQESGPVRST” seq1b &lt;- s2c(seq1) seq2b &lt;- s2c(seq2) source(“dotplot.R”) makeDotPlot1(seq1b,seq2b,dotsize=1)
  • #6 In R: setwd(&quot;C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/MScCourseLectures/MB6301Lectures/MB6301_Ls3456_Aln&quot;) library(&quot;seqinr&quot;) seq1 &lt;- “GCATCGGC” seq2 &lt;- “CCATCGCCATCG” seq1b &lt;- s2c(seq1) seq2b &lt;- s2c(seq2) source(“dotplot.R”) makeDotPlot1(seq1b,seq2b,dotsize=1)
  • #8 In R: setwd(&quot;C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/MScCourseLectures/MB6301Lectures/MB6301_Ls3456_Aln&quot;) library(&quot;seqinr&quot;) seq1 &lt;- “GCATCGGC” seq2 &lt;- “CCATCGCCATCG” seq1b &lt;- s2c(seq1) seq2b &lt;- s2c(seq2) source(“dotplot.R”) makeDotPlot2(seq1b,seq2b,dotsize=1,windowsize=3,threshold=2)
  • #9 setwd(&quot;C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/MScCourseLectures/MB6301Lectures/MB6301_Ls3456_Aln&quot;) library(&quot;seqinr&quot;) seq1 &lt;- read.fasta(“human.fa”) # human Eyeless seq2 &lt;- read.fasta(“fly.fa”) # fruitfly Eyeless seq1b &lt;- seq1[[1]] seq2b &lt;- seq2[[1]] source(“dotplot.R”) makeDotPlot2(seq1b,seq2b,dotsize=1,windowsize=10,threshold=3) Saved picture as dotplot2.png
  • #10 setwd(&quot;C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/MScCourseLectures/MB6301Lectures/MB6301_Ls3456_Aln&quot;) library(&quot;seqinr&quot;) seq1 &lt;- read.fasta(“human.fa”) # human Eyeless seq2 &lt;- read.fasta(“fly.fa”) # fruitfly Eyeless seq1b &lt;- seq1[[1]] seq2b &lt;- seq2[[1]] source(“dotplot.R”) makeDotPlot2(seq1b,seq2b,dotsize=1,windowsize=10,threshold=5) Saved picture as dotplot1.png
  • #14 In R: setwd(&quot;C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/MScCourseLectures/MB6301Lectures/MB6301_Ls3456_Aln&quot;) library(&quot;seqinr&quot;) seq1 &lt;- &quot;RQQEPVRSTC&quot; seq2 &lt;- &quot;QQESGPVRST&quot; seq1b &lt;- s2c(seq1) seq2b &lt;- s2c(seq2) source(&quot;dotplot.R&quot;) makeDotPlot2(seq1b,seq2b,dotsize=1,windowsize=3,threshold=3)