SlideShare a Scribd company logo
1 of 31
Quasi-Monte Carlo integration for
the fast and effective generation of
   molecular shape fingerprints
               John D. MacCuish
              Norah E. MacCuish,
     Michael Hawrylycz, and Mitch Chapman

            ACS San Francisco 2010
             COMP Drug Discovery


           john.maccuish@mesaac.com
Outline
• Why Shape? Why Shape Fingerprints?
• Prior Work




                                  1.00
                                  0.90
                                  0.80
                                  0.70
                                  0.60
• quasi-Monte Carlo Integration




                                  0.50
                                  0.40
                                  0.30
                                  0.20
                                  0.10
                                  0.00
                                         0.00   0.10   0.20   0.30   0.40   0.50   0.60   0.70   0.80   0.90   1.00




• Shape Fingerprint Generation
• Performance, Examples
• Future Work
Why Shape?
• Shape is a component in binding...
  •   3D QSAR
  •   similarity searching
  •   compound acquisition (shape diversity)
  •   virtual screening...

• Can be confounding -- multiple
  binding sites, surface binding...
• One more tool in the toolbox...
• Boost to 2D
  • Pluses and minuses: Molecular Shape and Medicinal Chemistry:
      A Perspective, JMedChem, Feb. 2010, Nicholls, et al
Why Shape Fingerprints?

• Compute shape comparisons
  efficiently on a large scale

• Efficient storage
• Simple.
Prior Work
• Shape as a mixture of spherical
  gaussians (atoms) (Grant and
  Pickup...’96)
• Fingerprints composed of key
  shapes, given the above method
  (Haigh, et al...’05).
• Ultrafast Shape Recognition:
  Ballester, Richards 2007
• ShaEP, Vainio, et al 2009
Monte Carlo Integration
• Approximate an integral (e.g., a 3D
  volume) by random sampling
• Approximate volume of odd shapes or
  manifolds that are analytically difficult
  or have no closed form solution.
• Better error convergence than grid
  sampling (Metropolis).
quasi-Monte Carlo
   Integration (QMC)
• In practice, quasi-randomly generated
  points have best error convergence in
  low dimensions.
• Beats uniform random sampling (e.g.,
  pseudo, dart throwing, etc.)
• QMC became popular in early-90s in
  computer graphics (Shirley, ’91) mid-90s
  in finance - options/futures pricing
  (Morokoff, Caflisch, ’95)
Approximate Volume
                                                                                                                                                                                 quasi-random
                                   grid                                                           pseudo-random                                                                (low discrepancy)                                                            P
1.00




                                                                                    1.00




                                                                                                                                                                        1.00
0.90




                                                                                    0.90




                                                                                                                                                                        0.90
0.80




                                                                                    0.80




                                                                                                                                                                        0.80
0.70




                                                                                    0.70




                                                                                                                                                                        0.70
0.60




                                                                                    0.60




                                                                                                                                                                        0.60
0.50




                                                                                    0.50




                                                                                                                                                                        0.50
0.40




                                                                                    0.40




                                                                                                                                                                        0.40
0.30




                                                                                    0.30




                                                                                                                                                                        0.30
0.20




                                                                                    0.20




                                                                                                                                                                        0.20
0.10




                                                                                    0.10




                                                                                                                                                                        0.10
0.00




                                                                                    0.00




                                                                                                                                                                        0.00
                                                                                                                                                                               0.00   0.10   0.20   0.30   0.40   0.50   0.60   0.70   0.80   0.90   1.00
       0.00   0.10   0.20   0.30   0.40   0.50   0.60   0.70   0.80   0.90   1.00          0.00   0.10   0.20   0.30   0.40   0.50   0.60   0.70   0.80   0.90   1.00



        Error related to the                                                                 Error related to just                                                             Error related to just
         number of points                                                                     number of points                                                                  number of points
        and the dimension
Approximation Error
• grid -- for error ε, need 1/εd grid points
  -- error convergence exponential in
  the dimension
• pseudo -- 1⁄√n convergence
• quasi -- 1/n convergence in practice
  in low dimensions
• quasi-random fewest points needed
  for equivalent approximation error.
Approximating the
Volume of a molecule
• E.g., CPK with van der Waals radii --
  Set of intersecting spheres
• Find a suitable bounding region for
  molecules
• Point sample the bounding region
  and tally up the points either inside
  or outside of the atom spheres.
Approximating the
Volume of a molecule
• Bounding Volume times (Points
  Inside)/(All Points) ~= molecular
  volume.

• Sphere or scalene ellipsoid as a
  bounding region reduces total overall
  points.
Volume Percent Error
   95% below    95% below
                             4961 low energy
   4.5% error   3.6% error
                             conformations
                             1357 Molecules
                             100-550 MW
                             flexible, inflexible

                            balloon
                            conformations
  95% below      95% below 1-15 conformations
   2% error      1.3% error
Fingerprint Generation
     Preliminaries
• Need a bounding region that covers the
  conformational space of the database of
  small molecule conformations.

• Cube? Sphere? Scalene Ellipsoid.
• Need orientation and alignment.
Fewer points
                          1.00
                          0.90
                          0.80
                          0.70
                          0.60




Don’t need these points                                                                                       Don’t need these points
                          0.50
                          0.40
                          0.30
                          0.20
                          0.10
                          0.00




                                 0.00   0.10   0.20   0.30   0.40   0.50   0.60   0.70   0.80   0.90   1.00
Fingerprint Generation
1. Generate a sample quasi-point set bounding
   region centered and fixed at (0,0,0); Sphere of
   ~11 Å radius. This is fixed, done once.
2. Mean center atom centers point set of a
   conformation to be fingerprinted.
3. Find sample points in the atom spheres (or
   function or your choice, e.g., solvent accessible
   surface ...)
4. Find the principal axes for a shape using the
   sampled points and the atom centers with SVD.
   Adjust for SVD sign ambiguity (Bro, et al, 2008).
Fingerprint Generation
5. Rotate atom centers point set to Principal
   Axes (PA)
6. Find points in PA configuration and
   create fingerprint with 4 orientations --
   points in molecule ‘1’, points out ‘0’
7. Subfingerprints, each the length of the
   number of points.
Molecule in a Volume
of quasi-Random points
Four configurations
Alignment Overlay
Alignment Overlay
Shape Fingerprints
• 4 fingerprints per shape. e.g., 10,240
  X 4 bits = 5.12 Kbytes.

• Number and density of points
  determines resolution, speed of
  comparison, and storage size.

• Number of bits on is small, typically
  3-10% with CPK model. Fingerprints
  compress significantly.
Fingerprint Similarity

• Choose 1 subFP of Shape FP A and compare it
  with all 4 subFPs of Shape FP B.

• The largest similarity comparison is chosen.
• Bit difference per subFP is small in a given FP
Fingerprint Similarity
• Inherent slight asymmetry of alignment of
  point sets of different sizes.

• Aligning B to A’s principal axes, vs aligning A
  to B’s principal axes.

• Try this on your favorite method...
• Use bit string similarity measure of your
  choice... Tanimoto, Ochai (Cosine), ...,
  Tversky,... Baroni-Urbani/Bush...
Performance
• With ~10K points and including IO:
• Fingerprint generation: ~500K
   conformations per hour

• Fingerprint Tanimoto comparisons:
   130KX130K matrix in 24 hours

• Large scale similarity searching trivial,
   e.g., 10X1M < 10 minutes
• Numbers estimated with 2009 iMac, with an Intel Core i7 2.8GHz processor
   and 4GB RAM, running Mac OS X 10.6.2; small to large compounds;
Space!
• 5.12 Kbytes per fingerprint -- ~90%
  compression

• 2 million conformations = 1+ GBs
• Space and CPU improving all of the
  time...
FP Experiments
         Typical all-pairs
       Tanimoto similarity
           distribution
      1357, 2D 768 MACCS
        Keys fingerprints

        Where the action is:
        Similarity searching,
          Clustering, etc.
             above 0.7
FP Experiments
All-pairs ShapeFingerprints
        4961 confs.


      Where the action is:
      Similarity searching,
                              Still values above 0.7
           Clustering,
        Alignment, etc.
         above ~0.65
Examples
       • Multi-conformation set of actives in a
           secondary screen in ROCK2 from
           PubChem

       • Xray ROCK2 ligand
       • Fingerprint shape cluster
       • Analyze shape cluster that contains the
           xray structure with ChemTattoo 3D
“Fragment database analysis using molecular shape fingerprints”,
        ACS San Francisco, Wednesday 8:40 , CINF 106
Shape Clustering
Example - ROCK2
Future Work
• Other shape functions
• Other low discrepancy point sets in
  different distributions -- density
  where it is needed.

• Speed (time) and compression
  (space)

• More practical applications
Acknowledgments
• JMol

• Balloon

• OpenBabel

• R
• PubChem

• PDB

            john.maccuish@mesaac.com

More Related Content

What's hot

Quacera Dow Component Early Warning Report May 10, 2012
Quacera Dow Component Early Warning Report May 10, 2012Quacera Dow Component Early Warning Report May 10, 2012
Quacera Dow Component Early Warning Report May 10, 2012Quacera L.L.C.
 
Wavelet Transform - Intro
Wavelet Transform - IntroWavelet Transform - Intro
Wavelet Transform - IntroImane Haf
 
TV Monthly Report Mar 2011 (Pakistan)
TV Monthly Report Mar 2011 (Pakistan)TV Monthly Report Mar 2011 (Pakistan)
TV Monthly Report Mar 2011 (Pakistan)Ejaz Asi
 
TV Monthly Report Feb 2011 (Pakistan)
TV Monthly Report Feb 2011 (Pakistan)TV Monthly Report Feb 2011 (Pakistan)
TV Monthly Report Feb 2011 (Pakistan)Ejaz Asi
 
G1 hb 2011 2012 23 excitation-contraction coupling (steendijk)
G1 hb 2011 2012 23 excitation-contraction coupling (steendijk)G1 hb 2011 2012 23 excitation-contraction coupling (steendijk)
G1 hb 2011 2012 23 excitation-contraction coupling (steendijk)PaulSteendijk
 
Shape from Distortion - 3D Digitization
Shape from Distortion - 3D DigitizationShape from Distortion - 3D Digitization
Shape from Distortion - 3D DigitizationVanya Valindria
 
Quacera Dow Jones Early Warning Reports Sept 28, 2012
Quacera Dow Jones Early Warning Reports Sept 28, 2012Quacera Dow Jones Early Warning Reports Sept 28, 2012
Quacera Dow Jones Early Warning Reports Sept 28, 2012Quacera L.L.C.
 
Ispra 2007 luis martín2
Ispra 2007 luis martín2Ispra 2007 luis martín2
Ispra 2007 luis martín2IrSOLaV Pomares
 
Mileage Log 2009
Mileage Log 2009Mileage Log 2009
Mileage Log 2009guest78aa88
 
Visual data mining with HeatMiner
Visual data mining with HeatMinerVisual data mining with HeatMiner
Visual data mining with HeatMinerCloudNSci
 

What's hot (10)

Quacera Dow Component Early Warning Report May 10, 2012
Quacera Dow Component Early Warning Report May 10, 2012Quacera Dow Component Early Warning Report May 10, 2012
Quacera Dow Component Early Warning Report May 10, 2012
 
Wavelet Transform - Intro
Wavelet Transform - IntroWavelet Transform - Intro
Wavelet Transform - Intro
 
TV Monthly Report Mar 2011 (Pakistan)
TV Monthly Report Mar 2011 (Pakistan)TV Monthly Report Mar 2011 (Pakistan)
TV Monthly Report Mar 2011 (Pakistan)
 
TV Monthly Report Feb 2011 (Pakistan)
TV Monthly Report Feb 2011 (Pakistan)TV Monthly Report Feb 2011 (Pakistan)
TV Monthly Report Feb 2011 (Pakistan)
 
G1 hb 2011 2012 23 excitation-contraction coupling (steendijk)
G1 hb 2011 2012 23 excitation-contraction coupling (steendijk)G1 hb 2011 2012 23 excitation-contraction coupling (steendijk)
G1 hb 2011 2012 23 excitation-contraction coupling (steendijk)
 
Shape from Distortion - 3D Digitization
Shape from Distortion - 3D DigitizationShape from Distortion - 3D Digitization
Shape from Distortion - 3D Digitization
 
Quacera Dow Jones Early Warning Reports Sept 28, 2012
Quacera Dow Jones Early Warning Reports Sept 28, 2012Quacera Dow Jones Early Warning Reports Sept 28, 2012
Quacera Dow Jones Early Warning Reports Sept 28, 2012
 
Ispra 2007 luis martín2
Ispra 2007 luis martín2Ispra 2007 luis martín2
Ispra 2007 luis martín2
 
Mileage Log 2009
Mileage Log 2009Mileage Log 2009
Mileage Log 2009
 
Visual data mining with HeatMiner
Visual data mining with HeatMinerVisual data mining with HeatMiner
Visual data mining with HeatMiner
 

Similar to QMC-based Shape Fingerprints

Fctcp Chw Trends 2 12 2007 10 2
Fctcp Chw Trends 2 12 2007 10 2Fctcp Chw Trends 2 12 2007 10 2
Fctcp Chw Trends 2 12 2007 10 2guest6ac2d0
 
Bms 9to5 worksched1
Bms 9to5 worksched1Bms 9to5 worksched1
Bms 9to5 worksched1Confidential
 
Miltenburg M Sc Presentation Tu Delft
Miltenburg   M Sc Presentation Tu DelftMiltenburg   M Sc Presentation Tu Delft
Miltenburg M Sc Presentation Tu Delftimiltenburg
 
Combo exceltimesheet
Combo exceltimesheetCombo exceltimesheet
Combo exceltimesheetConfidential
 
Combo exceltimesheet 1
Combo exceltimesheet 1Combo exceltimesheet 1
Combo exceltimesheet 1Confidential
 
Copy of combo exceltimesheet
Copy of combo exceltimesheetCopy of combo exceltimesheet
Copy of combo exceltimesheetConfidential
 
Combo exceltimesheet
Combo exceltimesheetCombo exceltimesheet
Combo exceltimesheetConfidential
 
Travel expense report (online form)
Travel expense report (online form)Travel expense report (online form)
Travel expense report (online form)Confidential
 
Apos11 defining dt cut-offs using function [p p1-52 fri]
Apos11    defining dt cut-offs using function [p p1-52 fri]Apos11    defining dt cut-offs using function [p p1-52 fri]
Apos11 defining dt cut-offs using function [p p1-52 fri]AJMitchell_Posters
 
Fctcp Chw Trends 2 1 2007 10 7
Fctcp Chw Trends 2 1 2007 10 7Fctcp Chw Trends 2 1 2007 10 7
Fctcp Chw Trends 2 1 2007 10 7guest6ac2d0
 
Fctcp Chw Trends 2 13 2007 10 2
Fctcp Chw Trends 2 13 2007 10 2Fctcp Chw Trends 2 13 2007 10 2
Fctcp Chw Trends 2 13 2007 10 2guest6ac2d0
 
Everoil Presentation June 09
Everoil Presentation June 09Everoil Presentation June 09
Everoil Presentation June 09viewwithin
 
Site Speed Tuneup: Putting Your Code On A Diet
Site Speed Tuneup: Putting Your Code On A DietSite Speed Tuneup: Putting Your Code On A Diet
Site Speed Tuneup: Putting Your Code On A Dietmm_merchant
 
All india gsm figures july 2012 final (2)
All india gsm figures july 2012   final (2)All india gsm figures july 2012   final (2)
All india gsm figures july 2012 final (2)vineet agarwal
 
Fctcp Chw Trends 2 13 2007 18 3
Fctcp Chw Trends 2 13 2007 18 3Fctcp Chw Trends 2 13 2007 18 3
Fctcp Chw Trends 2 13 2007 18 3guest6ac2d0
 
Wait, It's Actually SOCIAL Media?
Wait, It's Actually SOCIAL Media?Wait, It's Actually SOCIAL Media?
Wait, It's Actually SOCIAL Media?Robert Murray
 
Presentación1
Presentación1Presentación1
Presentación1gmjendav
 
Flevy.com - Paycheck Stub Template
Flevy.com - Paycheck Stub TemplateFlevy.com - Paycheck Stub Template
Flevy.com - Paycheck Stub TemplateDavid Tracy
 

Similar to QMC-based Shape Fingerprints (20)

Fctcp Chw Trends 2 12 2007 10 2
Fctcp Chw Trends 2 12 2007 10 2Fctcp Chw Trends 2 12 2007 10 2
Fctcp Chw Trends 2 12 2007 10 2
 
Bms 9to5 worksched1
Bms 9to5 worksched1Bms 9to5 worksched1
Bms 9to5 worksched1
 
Miltenburg M Sc Presentation Tu Delft
Miltenburg   M Sc Presentation Tu DelftMiltenburg   M Sc Presentation Tu Delft
Miltenburg M Sc Presentation Tu Delft
 
Combo exceltimesheet
Combo exceltimesheetCombo exceltimesheet
Combo exceltimesheet
 
Combo exceltimesheet 1
Combo exceltimesheet 1Combo exceltimesheet 1
Combo exceltimesheet 1
 
Copy of combo exceltimesheet
Copy of combo exceltimesheetCopy of combo exceltimesheet
Copy of combo exceltimesheet
 
Combo exceltimesheet
Combo exceltimesheetCombo exceltimesheet
Combo exceltimesheet
 
Travel expense report (online form)
Travel expense report (online form)Travel expense report (online form)
Travel expense report (online form)
 
Apos11 defining dt cut-offs using function [p p1-52 fri]
Apos11    defining dt cut-offs using function [p p1-52 fri]Apos11    defining dt cut-offs using function [p p1-52 fri]
Apos11 defining dt cut-offs using function [p p1-52 fri]
 
Fctcp Chw Trends 2 1 2007 10 7
Fctcp Chw Trends 2 1 2007 10 7Fctcp Chw Trends 2 1 2007 10 7
Fctcp Chw Trends 2 1 2007 10 7
 
Fctcp Chw Trends 2 13 2007 10 2
Fctcp Chw Trends 2 13 2007 10 2Fctcp Chw Trends 2 13 2007 10 2
Fctcp Chw Trends 2 13 2007 10 2
 
Everoil Presentation June 09
Everoil Presentation June 09Everoil Presentation June 09
Everoil Presentation June 09
 
Site Speed Tuneup: Putting Your Code On A Diet
Site Speed Tuneup: Putting Your Code On A DietSite Speed Tuneup: Putting Your Code On A Diet
Site Speed Tuneup: Putting Your Code On A Diet
 
Satisfaction Scores 3
Satisfaction Scores 3Satisfaction Scores 3
Satisfaction Scores 3
 
Modeling & Simulation
Modeling & SimulationModeling & Simulation
Modeling & Simulation
 
All india gsm figures july 2012 final (2)
All india gsm figures july 2012   final (2)All india gsm figures july 2012   final (2)
All india gsm figures july 2012 final (2)
 
Fctcp Chw Trends 2 13 2007 18 3
Fctcp Chw Trends 2 13 2007 18 3Fctcp Chw Trends 2 13 2007 18 3
Fctcp Chw Trends 2 13 2007 18 3
 
Wait, It's Actually SOCIAL Media?
Wait, It's Actually SOCIAL Media?Wait, It's Actually SOCIAL Media?
Wait, It's Actually SOCIAL Media?
 
Presentación1
Presentación1Presentación1
Presentación1
 
Flevy.com - Paycheck Stub Template
Flevy.com - Paycheck Stub TemplateFlevy.com - Paycheck Stub Template
Flevy.com - Paycheck Stub Template
 

QMC-based Shape Fingerprints

  • 1. Quasi-Monte Carlo integration for the fast and effective generation of molecular shape fingerprints John D. MacCuish Norah E. MacCuish, Michael Hawrylycz, and Mitch Chapman ACS San Francisco 2010 COMP Drug Discovery john.maccuish@mesaac.com
  • 2. Outline • Why Shape? Why Shape Fingerprints? • Prior Work 1.00 0.90 0.80 0.70 0.60 • quasi-Monte Carlo Integration 0.50 0.40 0.30 0.20 0.10 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 • Shape Fingerprint Generation • Performance, Examples • Future Work
  • 3. Why Shape? • Shape is a component in binding... • 3D QSAR • similarity searching • compound acquisition (shape diversity) • virtual screening... • Can be confounding -- multiple binding sites, surface binding... • One more tool in the toolbox... • Boost to 2D • Pluses and minuses: Molecular Shape and Medicinal Chemistry: A Perspective, JMedChem, Feb. 2010, Nicholls, et al
  • 4. Why Shape Fingerprints? • Compute shape comparisons efficiently on a large scale • Efficient storage • Simple.
  • 5. Prior Work • Shape as a mixture of spherical gaussians (atoms) (Grant and Pickup...’96) • Fingerprints composed of key shapes, given the above method (Haigh, et al...’05). • Ultrafast Shape Recognition: Ballester, Richards 2007 • ShaEP, Vainio, et al 2009
  • 6. Monte Carlo Integration • Approximate an integral (e.g., a 3D volume) by random sampling • Approximate volume of odd shapes or manifolds that are analytically difficult or have no closed form solution. • Better error convergence than grid sampling (Metropolis).
  • 7. quasi-Monte Carlo Integration (QMC) • In practice, quasi-randomly generated points have best error convergence in low dimensions. • Beats uniform random sampling (e.g., pseudo, dart throwing, etc.) • QMC became popular in early-90s in computer graphics (Shirley, ’91) mid-90s in finance - options/futures pricing (Morokoff, Caflisch, ’95)
  • 8. Approximate Volume quasi-random grid pseudo-random (low discrepancy) P 1.00 1.00 1.00 0.90 0.90 0.90 0.80 0.80 0.80 0.70 0.70 0.70 0.60 0.60 0.60 0.50 0.50 0.50 0.40 0.40 0.40 0.30 0.30 0.30 0.20 0.20 0.20 0.10 0.10 0.10 0.00 0.00 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Error related to the Error related to just Error related to just number of points number of points number of points and the dimension
  • 9. Approximation Error • grid -- for error ε, need 1/εd grid points -- error convergence exponential in the dimension • pseudo -- 1⁄√n convergence • quasi -- 1/n convergence in practice in low dimensions • quasi-random fewest points needed for equivalent approximation error.
  • 10. Approximating the Volume of a molecule • E.g., CPK with van der Waals radii -- Set of intersecting spheres • Find a suitable bounding region for molecules • Point sample the bounding region and tally up the points either inside or outside of the atom spheres.
  • 11. Approximating the Volume of a molecule • Bounding Volume times (Points Inside)/(All Points) ~= molecular volume. • Sphere or scalene ellipsoid as a bounding region reduces total overall points.
  • 12. Volume Percent Error 95% below 95% below 4961 low energy 4.5% error 3.6% error conformations 1357 Molecules 100-550 MW flexible, inflexible balloon conformations 95% below 95% below 1-15 conformations 2% error 1.3% error
  • 13. Fingerprint Generation Preliminaries • Need a bounding region that covers the conformational space of the database of small molecule conformations. • Cube? Sphere? Scalene Ellipsoid. • Need orientation and alignment.
  • 14. Fewer points 1.00 0.90 0.80 0.70 0.60 Don’t need these points Don’t need these points 0.50 0.40 0.30 0.20 0.10 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
  • 15. Fingerprint Generation 1. Generate a sample quasi-point set bounding region centered and fixed at (0,0,0); Sphere of ~11 Å radius. This is fixed, done once. 2. Mean center atom centers point set of a conformation to be fingerprinted. 3. Find sample points in the atom spheres (or function or your choice, e.g., solvent accessible surface ...) 4. Find the principal axes for a shape using the sampled points and the atom centers with SVD. Adjust for SVD sign ambiguity (Bro, et al, 2008).
  • 16. Fingerprint Generation 5. Rotate atom centers point set to Principal Axes (PA) 6. Find points in PA configuration and create fingerprint with 4 orientations -- points in molecule ‘1’, points out ‘0’ 7. Subfingerprints, each the length of the number of points.
  • 17. Molecule in a Volume of quasi-Random points
  • 21. Shape Fingerprints • 4 fingerprints per shape. e.g., 10,240 X 4 bits = 5.12 Kbytes. • Number and density of points determines resolution, speed of comparison, and storage size. • Number of bits on is small, typically 3-10% with CPK model. Fingerprints compress significantly.
  • 22. Fingerprint Similarity • Choose 1 subFP of Shape FP A and compare it with all 4 subFPs of Shape FP B. • The largest similarity comparison is chosen. • Bit difference per subFP is small in a given FP
  • 23. Fingerprint Similarity • Inherent slight asymmetry of alignment of point sets of different sizes. • Aligning B to A’s principal axes, vs aligning A to B’s principal axes. • Try this on your favorite method... • Use bit string similarity measure of your choice... Tanimoto, Ochai (Cosine), ..., Tversky,... Baroni-Urbani/Bush...
  • 24. Performance • With ~10K points and including IO: • Fingerprint generation: ~500K conformations per hour • Fingerprint Tanimoto comparisons: 130KX130K matrix in 24 hours • Large scale similarity searching trivial, e.g., 10X1M < 10 minutes • Numbers estimated with 2009 iMac, with an Intel Core i7 2.8GHz processor and 4GB RAM, running Mac OS X 10.6.2; small to large compounds;
  • 25. Space! • 5.12 Kbytes per fingerprint -- ~90% compression • 2 million conformations = 1+ GBs • Space and CPU improving all of the time...
  • 26. FP Experiments Typical all-pairs Tanimoto similarity distribution 1357, 2D 768 MACCS Keys fingerprints Where the action is: Similarity searching, Clustering, etc. above 0.7
  • 27. FP Experiments All-pairs ShapeFingerprints 4961 confs. Where the action is: Similarity searching, Still values above 0.7 Clustering, Alignment, etc. above ~0.65
  • 28. Examples • Multi-conformation set of actives in a secondary screen in ROCK2 from PubChem • Xray ROCK2 ligand • Fingerprint shape cluster • Analyze shape cluster that contains the xray structure with ChemTattoo 3D “Fragment database analysis using molecular shape fingerprints”, ACS San Francisco, Wednesday 8:40 , CINF 106
  • 30. Future Work • Other shape functions • Other low discrepancy point sets in different distributions -- density where it is needed. • Speed (time) and compression (space) • More practical applications
  • 31. Acknowledgments • JMol • Balloon • OpenBabel • R • PubChem • PDB john.maccuish@mesaac.com

Editor's Notes