SlideShare a Scribd company logo
Comparisons of Sequence Alignment Scoring Functions:

  On the Use of Structural Information to Improve Performance




                        Feb 6, 2008
What’s a scoring function?

                                a b b c d d d e f g
                            a
                            b
                            c
                            d   a b b c d d d e f g
                            e
                            f   a b - c d - - e f g
                            g

                                a b b c d d d e f g

                                a b - c c d - e f g
                                      - d -




              Optimal ::   MAX ( ∑ S (•) - ∑C ( – ) )
                           similarity S        cost C > 0

Aims
Optimal alignment problem:                        Native alignment scores best
SO alignment sampling problem:          Native alignment scores best &
                                        Poor alignments kept at minimum &
                                        Avoid “unproductive” alignments
Productive versus Unproductive Alignment Sampling


                            AK




                                 AML


                                            YYY
                      XXX
                                                                    A       A
                                                  XXXAAAYYY
   XXXAAADEFAAAYYY
                                                  XXX--aYYY             a
   XXX-AKLMA---YYY
                            A
                                                                            A




                                                               X
                                                  XXXAAAYYY




                                                              XX
   XXXAAADEFAAAYYY



                                 MLK


                                            YYY
   XXX--AKLMA--YYY    XXX                         XXX-a-YYY




                                                                    Y
                                                                   YY
                                       A          XXXAAAYYY
   XXXAAADEFAAAYYY
                                                  XXXa--YYY
   XXX---AKLMA-YYY
                                 LKA


                                            YYY
                      XXX




                                       MA




          Non-redundant (good)                        Redundant (not good)
Classes of Methods for Sampling Suboptimal Alignments
• Top-down Enumeration
    – Classical Waterman (Near-optimal alignments)
                                                                 Path > Opt-δ

• Iterative Elimination (IE)
    – Waterman & Eggert
    – Saqi, Bates & Sternberg


• Parametric Sampling (PS)
    – Chivian & Baker; 2006


• Combined IE + PS
    – Jaroszewski, Li & Godzik; 2002                 sample over
                                                     lots of …
• Stochastic Sampling                                P (sim1,gap1,ss1)
    – John & Sali; 2003                              P (sim2,gap1,ss1)
                                                     …
                                                     P (simn,gapn,ssn)
• Fragment Set Approach (S4)
Critical Questions

Am I ranking the most native alignment first?       Within the scope of
                                                    the scoring function
Am I eliminating poor/impossible alignments?
                                                     Within the scope of
Am I sampling efficiently/with little redundancy?    alignment sampling




 New GN2 v. HMAP – sp2 – sp3 – sp4
Organization

Talk about software library for doing sequence alignment

Talk about the HMAP and Sparks-family of scoring functions

New method: GN2

Benchmark design & results
T1        T2         T3                           Q1        Q2                HMAP2 – STL in C++
                                                                              (generic programming)

     Algorithm


      Evaluator                                    Enumerator

                          dynamic                       alignment
                                                                             Format
                          pgram’ing                     set
                          matrix                             [pair list]

                                sparks?                optimal
               HMAP       gnoali       gn2             S4        Waterman    RC   ?




      T        HMAP       Q        =      DPM                    aabbccdef
                                                                 aa---cdef
                                                                                      Fasta, PIR
                                                                 aabbccdef             (formatted
               ENU
                M
                          DPM      =         AS                  ---aacdef               output)
primary        secondary               Structure               residue
                                                                       depth
      sequence         structure

                            contact
        sequence-
                           numbers,
                                           solvent              depth-dep.            hydro-
        based prof.     distances, HBs   accessibility           a.a. freq           philicity

PSI BLAST


                           Template
                            Profile                      Algorithm              Alignments
    sequence
    database
            NR               Query
                             Profile                                            Models



       primary            Sequence-               PSIPRED
      sequence            based prof.            prediction
                                                                          SABLE
                                                                        prediction
a    b      c   d    e
                                                            a
                                                            b
                                                            x
                                                            y
                                                            e


        Affine gaps              Arbitrary gaps           Double-sided gaps            abcd--e
                                                          (zigzag alignment)           ab--xye
                        ss
                        coil

G



    0 1 2…
                         l     0 1 2…
                                                    l

       Fast, good for            Nonlinear gaps,             Most flexible,
       DB search                 structure-derived gaps      potentially most costly
       (HMAP)                    (AS Yang - 2002)            (A Sali - 2006)
HMAP                                secondary
                                     structure         gap
        sequence profile                            open, extn




                                              nf.
                                    H E C




                                            co
       .01 .02 … 0.45 ... 0.02     0   0   1   1     3.7   0.3
  T
       ……                          1   0   0   1                       SQ,T = dot [ aaQ , aaT ] * exp [ W * ssQT ]
  E
  M    …. …                        1   0   0   1
       ……                          1   0   0   1                                1 * confQ           : if ssQ = ssT
  P                                                                    ssQT
       … ….                        0   0   1   1
  L                                                                             -0.5 * confQ        : if ssQ ≠ ssT
  A    ……                          0   1   0   1
  T    …. …                        0   1   0   1                        W = 0.5 (new opt value = 0.55)
  E    … .04 .025 0.02             0   1   0   1    12.8   0.9

                                                                      ZQ,T = (SQ,T - µ) / σ
                                     PSIPRED
                                                       gap
        sequence profile                            open, extn
                                              nf.




                                    H E C
                                            co




       .02 .08 … 0.25 ... 0.02     0   0   1
  Q
  U    ……                          1   0   0                                    3.7,0.3        : if ssT = coil
       …. …
                                                                     GI,GE
  E                                1   0   0                                    12.8,0.9       : if ssT ≠ coil
  R    ……                          1   0   0
       … .03 .015 0.05             0   0   1
  Y
                                 = continuously valued from [0..1]
Sparks scoring functions
• Sparks 2
    – Sequence-based profile-to-profile
    – Secondary structure prediction using PSIPRED (Jones) [+1/-1]
• SP3
    – Sparks 2 plus…
    – Residue-depth dependent profile
• SP4
    – SP3 plus…
    – Solvent accessibility prediction using SABLE (Adamczak, Porollo, Meller)


• Trained (parameterized)
    – using ProSup (Sippl; 2000) alignments
• Tests performed
    –   Fold recognition (FR) + Model building: Lindahl FR set
    –   FR + Model building: LiveBench 8 (MaxSub)
    –   FR + Model building: CASP7 (GDT Z-score)
    –   Alignment: Sali’s test set (200 pairs, 65% overlap, 3.5 Å) (TM overlap)
HMAP                      GN2
   Sequence-based           Sequence-based
      profile                   profile (AA)
   Secondary structure      Contact number (CN)
   Affine gap penalty       Secondary structure (SS)
                             Hydrophilicity index (HI)
                             Structure-derived gap penalty
                                 Geometric distance (GP = exp (D – 8Å))
                                 Hydrogen bonding
                                 Insertions more likely with small CN
                                 Deletions beg./end in same SS =
                                impossible (very high GP)
Log-likelihood ratios from structural alignments

     SKA                            Make training alignments




                                             Count
                                          frequencies

                                                                      Convert to
                                                                 log-likelihood ratios
    S (i,j) = LLR0 +
              wAA * LLRAA (i,j) +                                        f structure 
                                                               LLR = log
                                                                         f           
                                                                                      
              wSS * LLRSS (i,j) +
                                                                         random 
              wCN * LLRCN (i,j) +
              wHI * LLRHI (i,j)
Log-odd substitution matrix for aligning SS-to-predicted SS (PSI-PRED)
based on structural alignment (SKA)
Should we use dot [ aaQ , aaT ] ?
Construction of a log-odds score based on the cos-angle function between profiles
CA _ atoms
                         1
CN = 0.72      ∑         r2
Construction of a log-odds score based on contact number counts of structural alignments




                                                                 1 CA _ atoms  1               CA _ atoms
                                                                                                          1
                                               N weighted_CN =
                                                                 20
                                                                      ∑ ( r / 3 .8 Å ) 2 = 0.72 ∑
                                                                                                          r2
K   RE   QD   N   P H   ST GY W AMFLVIC


                                     hydrophilicity index




                                  profile
                           HI =   ∑ i
                                        HI i
Construction of a log-odds score based on observed levels of HI agreement btwn the Q&T

          K                 RE         QD    N   P H    ST GY W AMFLVIC




              Observed                                                               Fitted




                                                    (
                                                 exp exp
                                                                (
                                                           − abs H Q − H T   )
                                                                                                                )
                                                                                 ⋅ ( .75 + .3 * abs ( H T − .22) ) − 1.8
Training and Benchmarking Sets
SCOP 1.71 all vs. all ( skan psd < 0.6, rmsd < 3.5 )  1M pairs



      sort pairs by % sid ( from 0%, “devilish set” )

                                     re-order, 7.5% sid on top ( “difficult set” )

  filter ( ali len > 80, % sid < 40, ska psd < 0.6 )  326k pairs
                                                         no
                                Any more pairs?                     Done!  test set
                                                                    difficult: 995 pairs
                                                         yes        devilish: 913 pairs
take next top pair ( lowest % sid in list )
                                                   yes

                               Scop family already in benchmark?

                                                   no
        add pair to benchmark
SCOP 1.71
          pairs from all vs. all comparison

                                                       no
                                     Any more pairs?         Done! 
yes                                                          List of protein pairs
                                                       yes   w/o sequence
                                                             similarity to test set
          Blast against test set sequences


      No e-value < 1?



no             remove sequence pair

                                                             make training set…
                                                             difficult: 238 pairs
Scop 1.71 Training set results
Summary of counts:
Class:       5
Fold:        102
Superfamily: 120
                     +148 folds represented once
Sequence Identity:
0 - 5%       30
5 - 10%      110
10 - 15%     48
15 - 20%     18
20 - 25%     11
25 - 30%     5
30 - 35%     7
35 - 40%     9
40 - 45%
45 - 100%
all: 238

Classes:
c    49
d     44
b     32
a     23
e     2
Shift performance / Training data (238 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4




      39/31

                                                    54/18




      65/19                                         55/18



                                                                  3 gn2 alignments with shift > 50
Qmod performance / Training data (238 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4




    129/94                                       136/85




    119/105                                      110/114
Overall performance / Training data (238 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4




            Scoring function       Total shift           Residues aligned
                                                         correctly
            gn2                    1124                  21,500
            nalign                 1179*                 21,197
            sparks2                1522                  20,669
            sp3                    1607                  21,020
            sp4                    1672                  21,299*
Scop 1.71 Test set results
Summary of counts:
Class:       7
Fold:        341
Superfamily: 460
                     +230 folds represented once
Sequence Identity:
0 - 5%       72
5 - 10%      423
10 - 15%     148
15 - 20%     103
20 - 25%     90
25 - 30%     67
30 - 35%     42
35 - 40%     37
40 - 45%
45 - 100%
all:         995

Classes:
d            182
c            141
b            137
a            115
e            18
f            18
g            3
Shift performance / Test data (995 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4




        136/111                                      159/112




        174/102                                      161/111



                                                                  18 gn2 alignments with shift > 50
Qmod performance / Test data (995 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4
      nalign                                    sparks2




       524/342                                    544/344

     sp3                                         sp4




       514/379                                    489/408
Overall performance / Test data (995 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4

            Scoring function       Total shift           Residues aligned
                                                         correctly
            gn2                    4718                  110,289*
            sparks2                4935*                 109,328
            sp4                    5038                  111,351
            nalign                 5071                  109,349
            sp3                    5377                  110,172


                                   Total shift           Correctly aligned
                                   relative to best      relative to best
                                   (Wilcoxon test)       (Wilcoxon test)
            gn2                    0                     -1,062 (p < 0.22)
            sparks2                +217 (p < 5*10-4)     -2,023 (p < 5*10-4)
            sp4                    +320 (p < 5*10-4)     0
            nalign                 +353 (p < 5*10-4)     -2,002 (p < 1.36*10-2)
            sp3                    +659 (p < 5*10-4)     -1,179 (p < 5*10-4)
Spo0 set results
(Q = 1F51, T = Spo0 family)
141 ali’s      74/29




Scoring function   Total shift    Residues aligned
                                  correctly
gn2                1035 (-22%)    6283 (+13%)
nalign             1323           5547
Remarks

Apparent success of the LLR method, but some mysteries

Sali test set
          (next slide)

Performance is underestimated in alignments with structural repeats
        (next slide +1)

Need for looking at alternative structural alignments

Room for improvement
       E.g. adding FUGUE-like (Blundell) sequence-structure LLR
       -or- SABLE/SA prediction
       -or- IBR potential (Zhu)
Summary of counts:
Class: 7
Fold: 74
Superfamily: 86
NA: 2

Sequence Identity:                                                 +38 represented once
0 - 5%             13
5 - 10%            11
10 - 15%           11
15 - 20%           16
20 - 25%           79
25 - 30%           60
30 - 35%           24
35 - 40%           12
40 - 45%           4
45 - 100%          2
all:               239
(note: psid calculated by ska)

Classes:
b 99
c 95
d 63
a    34
e 4
g 2
f 1




                                 Madhusudhan MS, Marti-Renom MA, Sanchez R, Sali A.
                                 Variable gap penalty for protein sequence-structure alignment.
                                 Protein Eng Des Sel. 2006 Mar;19(3):129-33
Caveat (example #1) from Training Set
Sparks score




SP3 score
What’s a scoring function?


                                                 a b b c d d d e f g
                    a b b c d d d e f g      a
                                             b
                    a b - c d - - e f g      c                             a b b c d d d e f g
                                             d                         a
                                             e                         b
                                             f                         c
                                             g                         d
                                                                       e
                                                                       f
                                                                       g

                                                                           a b b c d d d e f g
                           max ∑ S (•) + min ∑ C ( – )                     a b - - c d - e f g

                          similarity S            cost C < 0


Aims
Optimal alignment problem:                          Native alignment scores best
Sampling suboptimal alignments:           Native alignment scores best &
                                          Poor alignments kept at minimum
Structural information in protein sequence alignment accuracy

More Related Content

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
Marius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Structural information in protein sequence alignment accuracy

  • 1. Comparisons of Sequence Alignment Scoring Functions: On the Use of Structural Information to Improve Performance Feb 6, 2008
  • 2. What’s a scoring function? a b b c d d d e f g a b c d a b b c d d d e f g e f a b - c d - - e f g g a b b c d d d e f g a b - c c d - e f g - d - Optimal :: MAX ( ∑ S (•) - ∑C ( – ) ) similarity S cost C > 0 Aims Optimal alignment problem: Native alignment scores best SO alignment sampling problem: Native alignment scores best & Poor alignments kept at minimum & Avoid “unproductive” alignments
  • 3. Productive versus Unproductive Alignment Sampling AK AML YYY XXX A A XXXAAAYYY XXXAAADEFAAAYYY XXX--aYYY a XXX-AKLMA---YYY A A X XXXAAAYYY XX XXXAAADEFAAAYYY MLK YYY XXX--AKLMA--YYY XXX XXX-a-YYY Y YY A XXXAAAYYY XXXAAADEFAAAYYY XXXa--YYY XXX---AKLMA-YYY LKA YYY XXX MA Non-redundant (good) Redundant (not good)
  • 4. Classes of Methods for Sampling Suboptimal Alignments • Top-down Enumeration – Classical Waterman (Near-optimal alignments) Path > Opt-δ • Iterative Elimination (IE) – Waterman & Eggert – Saqi, Bates & Sternberg • Parametric Sampling (PS) – Chivian & Baker; 2006 • Combined IE + PS – Jaroszewski, Li & Godzik; 2002 sample over lots of … • Stochastic Sampling P (sim1,gap1,ss1) – John & Sali; 2003 P (sim2,gap1,ss1) … P (simn,gapn,ssn) • Fragment Set Approach (S4)
  • 5. Critical Questions Am I ranking the most native alignment first? Within the scope of the scoring function Am I eliminating poor/impossible alignments? Within the scope of Am I sampling efficiently/with little redundancy? alignment sampling New GN2 v. HMAP – sp2 – sp3 – sp4
  • 6. Organization Talk about software library for doing sequence alignment Talk about the HMAP and Sparks-family of scoring functions New method: GN2 Benchmark design & results
  • 7. T1 T2 T3 Q1 Q2 HMAP2 – STL in C++ (generic programming) Algorithm Evaluator Enumerator dynamic alignment Format pgram’ing set matrix [pair list] sparks? optimal HMAP gnoali gn2 S4 Waterman RC ? T HMAP Q = DPM aabbccdef aa---cdef Fasta, PIR aabbccdef (formatted ENU M DPM = AS ---aacdef output)
  • 8. primary secondary Structure residue depth sequence structure contact sequence- numbers, solvent depth-dep. hydro- based prof. distances, HBs accessibility a.a. freq philicity PSI BLAST Template Profile Algorithm Alignments sequence database NR Query Profile Models primary Sequence- PSIPRED sequence based prof. prediction SABLE prediction
  • 9. a b c d e a b x y e Affine gaps Arbitrary gaps Double-sided gaps abcd--e (zigzag alignment) ab--xye ss coil G 0 1 2… l 0 1 2… l Fast, good for Nonlinear gaps, Most flexible, DB search structure-derived gaps potentially most costly (HMAP) (AS Yang - 2002) (A Sali - 2006)
  • 10. HMAP secondary structure gap sequence profile open, extn nf. H E C co .01 .02 … 0.45 ... 0.02 0 0 1 1 3.7 0.3 T …… 1 0 0 1 SQ,T = dot [ aaQ , aaT ] * exp [ W * ssQT ] E M …. … 1 0 0 1 …… 1 0 0 1 1 * confQ : if ssQ = ssT P ssQT … …. 0 0 1 1 L -0.5 * confQ : if ssQ ≠ ssT A …… 0 1 0 1 T …. … 0 1 0 1 W = 0.5 (new opt value = 0.55) E … .04 .025 0.02 0 1 0 1 12.8 0.9 ZQ,T = (SQ,T - µ) / σ PSIPRED gap sequence profile open, extn nf. H E C co .02 .08 … 0.25 ... 0.02 0 0 1 Q U …… 1 0 0 3.7,0.3 : if ssT = coil …. … GI,GE E 1 0 0 12.8,0.9 : if ssT ≠ coil R …… 1 0 0 … .03 .015 0.05 0 0 1 Y = continuously valued from [0..1]
  • 11. Sparks scoring functions • Sparks 2 – Sequence-based profile-to-profile – Secondary structure prediction using PSIPRED (Jones) [+1/-1] • SP3 – Sparks 2 plus… – Residue-depth dependent profile • SP4 – SP3 plus… – Solvent accessibility prediction using SABLE (Adamczak, Porollo, Meller) • Trained (parameterized) – using ProSup (Sippl; 2000) alignments • Tests performed – Fold recognition (FR) + Model building: Lindahl FR set – FR + Model building: LiveBench 8 (MaxSub) – FR + Model building: CASP7 (GDT Z-score) – Alignment: Sali’s test set (200 pairs, 65% overlap, 3.5 Å) (TM overlap)
  • 12. HMAP GN2  Sequence-based  Sequence-based profile profile (AA)  Secondary structure  Contact number (CN)  Affine gap penalty  Secondary structure (SS)  Hydrophilicity index (HI)  Structure-derived gap penalty  Geometric distance (GP = exp (D – 8Å))  Hydrogen bonding  Insertions more likely with small CN  Deletions beg./end in same SS = impossible (very high GP)
  • 13. Log-likelihood ratios from structural alignments SKA Make training alignments Count frequencies Convert to log-likelihood ratios S (i,j) = LLR0 + wAA * LLRAA (i,j) +  f structure  LLR = log  f   wSS * LLRSS (i,j) +  random  wCN * LLRCN (i,j) + wHI * LLRHI (i,j)
  • 14. Log-odd substitution matrix for aligning SS-to-predicted SS (PSI-PRED) based on structural alignment (SKA)
  • 15. Should we use dot [ aaQ , aaT ] ?
  • 16. Construction of a log-odds score based on the cos-angle function between profiles
  • 17. CA _ atoms 1 CN = 0.72 ∑ r2
  • 18. Construction of a log-odds score based on contact number counts of structural alignments 1 CA _ atoms 1 CA _ atoms 1 N weighted_CN = 20 ∑ ( r / 3 .8 Å ) 2 = 0.72 ∑ r2
  • 19. K RE QD N P H ST GY W AMFLVIC hydrophilicity index profile HI = ∑ i HI i
  • 20. Construction of a log-odds score based on observed levels of HI agreement btwn the Q&T K RE QD N P H ST GY W AMFLVIC Observed Fitted ( exp exp ( − abs H Q − H T ) ) ⋅ ( .75 + .3 * abs ( H T − .22) ) − 1.8
  • 22. SCOP 1.71 all vs. all ( skan psd < 0.6, rmsd < 3.5 )  1M pairs sort pairs by % sid ( from 0%, “devilish set” ) re-order, 7.5% sid on top ( “difficult set” ) filter ( ali len > 80, % sid < 40, ska psd < 0.6 )  326k pairs no Any more pairs? Done!  test set difficult: 995 pairs yes devilish: 913 pairs take next top pair ( lowest % sid in list ) yes Scop family already in benchmark? no add pair to benchmark
  • 23. SCOP 1.71 pairs from all vs. all comparison no Any more pairs? Done!  yes List of protein pairs yes w/o sequence similarity to test set Blast against test set sequences No e-value < 1? no remove sequence pair make training set… difficult: 238 pairs
  • 24. Scop 1.71 Training set results
  • 25. Summary of counts: Class: 5 Fold: 102 Superfamily: 120 +148 folds represented once Sequence Identity: 0 - 5% 30 5 - 10% 110 10 - 15% 48 15 - 20% 18 20 - 25% 11 25 - 30% 5 30 - 35% 7 35 - 40% 9 40 - 45% 45 - 100% all: 238 Classes: c 49 d 44 b 32 a 23 e 2
  • 26. Shift performance / Training data (238 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4 39/31 54/18 65/19 55/18 3 gn2 alignments with shift > 50
  • 27. Qmod performance / Training data (238 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4 129/94 136/85 119/105 110/114
  • 28. Overall performance / Training data (238 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4 Scoring function Total shift Residues aligned correctly gn2 1124 21,500 nalign 1179* 21,197 sparks2 1522 20,669 sp3 1607 21,020 sp4 1672 21,299*
  • 29. Scop 1.71 Test set results
  • 30. Summary of counts: Class: 7 Fold: 341 Superfamily: 460 +230 folds represented once Sequence Identity: 0 - 5% 72 5 - 10% 423 10 - 15% 148 15 - 20% 103 20 - 25% 90 25 - 30% 67 30 - 35% 42 35 - 40% 37 40 - 45% 45 - 100% all: 995 Classes: d 182 c 141 b 137 a 115 e 18 f 18 g 3
  • 31. Shift performance / Test data (995 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4 136/111 159/112 174/102 161/111 18 gn2 alignments with shift > 50
  • 32. Qmod performance / Test data (995 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4 nalign sparks2 524/342 544/344 sp3 sp4 514/379 489/408
  • 33. Overall performance / Test data (995 pairs) / gn2 vs nalign – sparks2 – sp3 – sp4 Scoring function Total shift Residues aligned correctly gn2 4718 110,289* sparks2 4935* 109,328 sp4 5038 111,351 nalign 5071 109,349 sp3 5377 110,172 Total shift Correctly aligned relative to best relative to best (Wilcoxon test) (Wilcoxon test) gn2 0 -1,062 (p < 0.22) sparks2 +217 (p < 5*10-4) -2,023 (p < 5*10-4) sp4 +320 (p < 5*10-4) 0 nalign +353 (p < 5*10-4) -2,002 (p < 1.36*10-2) sp3 +659 (p < 5*10-4) -1,179 (p < 5*10-4)
  • 34. Spo0 set results (Q = 1F51, T = Spo0 family)
  • 35. 141 ali’s 74/29 Scoring function Total shift Residues aligned correctly gn2 1035 (-22%) 6283 (+13%) nalign 1323 5547
  • 36. Remarks Apparent success of the LLR method, but some mysteries Sali test set (next slide) Performance is underestimated in alignments with structural repeats (next slide +1) Need for looking at alternative structural alignments Room for improvement E.g. adding FUGUE-like (Blundell) sequence-structure LLR -or- SABLE/SA prediction -or- IBR potential (Zhu)
  • 37. Summary of counts: Class: 7 Fold: 74 Superfamily: 86 NA: 2 Sequence Identity: +38 represented once 0 - 5% 13 5 - 10% 11 10 - 15% 11 15 - 20% 16 20 - 25% 79 25 - 30% 60 30 - 35% 24 35 - 40% 12 40 - 45% 4 45 - 100% 2 all: 239 (note: psid calculated by ska) Classes: b 99 c 95 d 63 a 34 e 4 g 2 f 1 Madhusudhan MS, Marti-Renom MA, Sanchez R, Sali A. Variable gap penalty for protein sequence-structure alignment. Protein Eng Des Sel. 2006 Mar;19(3):129-33
  • 38. Caveat (example #1) from Training Set
  • 39.
  • 41. What’s a scoring function? a b b c d d d e f g a b b c d d d e f g a b a b - c d - - e f g c a b b c d d d e f g d a e b f c g d e f g a b b c d d d e f g max ∑ S (•) + min ∑ C ( – ) a b - - c d - e f g similarity S cost C < 0 Aims Optimal alignment problem: Native alignment scores best Sampling suboptimal alignments: Native alignment scores best & Poor alignments kept at minimum