Chemical Database Preparation
for Compound Acquisition or
Virtual Screening
Lalit Samant
Research Officer
B J WADIA HOSPITAL FOR CHILDREN
Virtual Screening
• AIM:-
1. HTS
2. Biologically active
3. Rapid
4. Effective
Cont.
• The progression HTS hits = > HTS actives = > lead
series = > drug candidate = > launched drug has
shifted the focus from good-quality candidate
drugs to good-quality leads (10).
• A set of simple property filters known as the “rule
of five” (Ro5) (11) is implemented in the
pharmaceutical industry to restrict small-
molecule synthesis in the property space
defined by ClogP (octanol/water partition
coefficient), molecular weight etc.
Conditions to consider for Library
Desig
• Many library design programs based on
combinatorial chemistry or com- pound
acquisition are now Ro5 compliant.
• Smaller compounds are easier to optimize
toward the drug candidate status, and
leadlikeness has become an established con-
cept in drug discovery
Materials
1. Software to convert chemical structures based on standard file
formats (e.g., SDF, mol2) into canonical isomeric SMILES (15,16), or
equivalent representations of chemical structures
2. Software to handle canonical isomeric SMILES (or equivalent)
and provide chemicalfingerprints, e.g., Daylight (19), Unity (20), Mesa
Analytics and Computing (21), Barnard Chemical Information ([22];
3. Software to compute chemical properties from structures; e.g., to
calculate the octanol/ water partition coefficient, LogP with CLogP ,
KowWIN , or ALogPS
4. Software to cluster chemical structures from fingerprints or from
computed properties.
Cont.
5. Software to convert SMILES (or equivalent)
into appropriate three-dimensional (3D)–
coordinate systems using CONCORD
6. Software to appropriately handle D-optimal
design based on multidimensional spaces.
Methods
1. Assembling the Collection(s)
large pharmaceutical companies have acquired
compound collections, Reals , that contain a
significant number of molecules, including
marketed drugs and other high-activity
compounds. Reals-a valuable resource that is
routinely screened against novel targets.
Cont. Assembling
• such collections of structures must include existing sets
of commercially available chemicals, or Tangibles—
termed this way because one can conceivably acquire
them or synthesize them in-house using tractable
chemistry .
• Thus, any collection prepared for virtual or HTS would
sample both the in-house and the “external” chemical
spaces. In addition to the Reals and the Tangibles, one
can also define the Virtuals—an extremely large set of
molecules (1060–10200) that cannot all be made, at
least with current chemistry, but that can essentially be
used as “resource” for virtual screening.
Methods
2. Cleaning up the collection
There is no “perfect” chemical database, unless
it contains rather simple (e.g., NaCl, H2O) or a
rather small number of molecules. The user
needs to spend a significant effort in cleaning up
the collection, whether it includes Virtuals,
Reals, or Tangibles.
Cleaning up Cont.
2.1 Removing Garbage From the Collection
2.2 Verifying Integrity of Molecular Structure
2.3. Generation of Unique, Normalized SMILES
3. Filtering for Lead-Likeness
• After cleanup, the collection can be processed
to remove compounds that do not have
leadlike properties.
• It is advisable to cluster the remaining
“nonleadlike” set and to include a
representative set of these compounds (up to
30%), because they are likely to capture
additional chemotypes.
suggestions for exclusions according to
leadlikeness are as follows:
1. More than four rings.
2. More than three fused aromatic rings (avoid polyaromatic rings, because they
are likely to be processed by cytochrome P450 enzymes and yield epoxides and
other carcinogens).
3. HDO more than 4; HDO ≤ 5 is one of the Ro5 criteria, but 80% of drugs have HDO
less than 3
4. More than four halogens, except fluorine (avoid “pesticides”). A notable
exception is the crop-protectant business; in such situations, the collection must
be processed with entirely different criteria.
5. More than two CF3 groups (avoid highly halogenated molecules).
6. The removal of compounds that contain fragments responsible for
cytotoxicity
Important Note:-
• collection may t require different processing
criteria for different targets and discovery
goals;
• Eg- targets located in the lung require a
different pharmacokinetic profile,
• E.g., for inhalation therapy, compared with
targets located in the urinary tract that may
require good aqueous solubility at pH = 5.0
Methods cont.
3.4. Searching for Similarity If Known Active
Molecules are Available
3.5. Exploring Alternative Structures
The user should seek alternative structures by
modifying the canonical isomericSMILES, because
these may occur in solution or at the ligand-
receptor interface
a. Tautomerism,
b. Acid/base equilibria
c. chiral centers
Exploring alternative structures is advisable prior to
processing any collection with computational
means, such as for diversity analysis
3.6 Generating 3D Structures
• exploring one or more conformers per
molecule.- Very Essential
3.7. Selecting Chemical Structure Representatives
Screening compounds that are similar to known actives
increases the likelihood of finding new active compounds, but
it may not lead to different chemotypes, a highly desirable
situation in the industrial context. The severity of this
situation is increased if the original actives are covered by
third-party patents or if the lead chemotype is toxic.
Clustering methods aim at grouping molecules into “families”
(clusters) of related structures that are perceived—at a given
resolution— to be different from other chemical families.
With clustering, the end user has the ability to select one or
more representatives from each family. SMD methods aim at
sampling various areas of chemical space and selecting
representatives from each area.
3.7.1 Chemical descriptors
• Chemical descriptors are used to encode
chemical structures and properties of com-
pounds: 2D/3D binary fingerprints or counts
of different substructural features, or per-
haps (computed) physicochemical properties
(e.g., molecular weight, CLogP, HDO, HAC), as
well as other types of steric, electronic,
electrostatic, topological, or hydro- gen-
bonding descriptors.
3.7.2. Similarity (Dissimilarity)
Measure
• Chemical similarity is used to quantify the “distance”
between a pair of compounds (dissimilarity, or 1 −
similarity), or how related the two compounds are
(similarity).
• The basic tenet of chemical similarity is that molecules
exhibiting similar features are expected to have similar
biological activity (46).
• Similarity is, by definition, related to a particular
framework: that of a descriptor system (a metric by
which to judge similar- ity), as well as that of an object,
or class of objects, reference point with which objects
can be compared is needed (47).
• Similarity depends on the choice of molecular descrip-
tors (48), the choice of the weighting scheme(s), and
the similarity coefficient.
3.7.3. Clustering Algorithms
• Clustering algorithms can be classified using many criteria
and also implemented in different ways (29–32).
Hierarchical clustering methods have been traditionally
used to a greater extent, in part owing to computational
simplicity. More recently, chemical structure classifications
are examining nonhierarchical methods. In practice, the
indi- vidual choice of different factors (descriptors,
similarity measure, clustering algorithm) depends also on
the hardware and software resources available, the size
and diversity of the collection that must be clustered, and
not ultimately on the user experience in pro- ducing a
useful classification that has the ability to predict property
values.
3.7.4. Statistical Molecular Design
• SMD can be applied to rationally select
collection representatives, as illustrated for
building block selection in combinatorial
synthesis planning (55).
3.8. Assembling List of Compounds for
Acquisition or Virtual Screening
• Once provided with an output from one or
several methods for compound selection, the
now-selected collection representatives are
almost ready to be submitted for acquisition
or for virtual screening. The end user is
encouraged to allow non leadlike molecules to
be reentered into the candidate pool.
• An additional random, perhaps nonleadlike
selec- tion (up to 30%) can, and should, be
entered in the final list of compounds.
Summery
1. Assemble the collection starting from in-house and on-line databases.
2. Clean up the collection by removing “garbage,” verifying structural
integrity, and making sure that only unique structures are screened.
3. Perform property filtering to remove unwanted structures based on
substructures, property profiling, or various scoring schemes; the
collection can become the virtual screening set at this stage, or it can be
further subdivided in a target- and project-dependent manner.
4. Use similarity to given actives to seek compounds with related
properties.
5. Explore the possible stereoisomers, tautomers, and protonation state
6. Generate the 3D structures in preparation for virtual screening, or for
computation of 3Ddescriptors.
7. Use clustering or SMD to select compound representatives for
acquisition.
8. Add a random subset to the final list of compounds. The final list can
now be submitted for compound acquisition or virtual screening.
THANK YOU !!!

Chemical database preparation ppt

  • 1.
    Chemical Database Preparation forCompound Acquisition or Virtual Screening Lalit Samant Research Officer B J WADIA HOSPITAL FOR CHILDREN
  • 2.
    Virtual Screening • AIM:- 1.HTS 2. Biologically active 3. Rapid 4. Effective
  • 3.
    Cont. • The progressionHTS hits = > HTS actives = > lead series = > drug candidate = > launched drug has shifted the focus from good-quality candidate drugs to good-quality leads (10). • A set of simple property filters known as the “rule of five” (Ro5) (11) is implemented in the pharmaceutical industry to restrict small- molecule synthesis in the property space defined by ClogP (octanol/water partition coefficient), molecular weight etc.
  • 4.
    Conditions to considerfor Library Desig • Many library design programs based on combinatorial chemistry or com- pound acquisition are now Ro5 compliant. • Smaller compounds are easier to optimize toward the drug candidate status, and leadlikeness has become an established con- cept in drug discovery
  • 5.
    Materials 1. Software toconvert chemical structures based on standard file formats (e.g., SDF, mol2) into canonical isomeric SMILES (15,16), or equivalent representations of chemical structures 2. Software to handle canonical isomeric SMILES (or equivalent) and provide chemicalfingerprints, e.g., Daylight (19), Unity (20), Mesa Analytics and Computing (21), Barnard Chemical Information ([22]; 3. Software to compute chemical properties from structures; e.g., to calculate the octanol/ water partition coefficient, LogP with CLogP , KowWIN , or ALogPS 4. Software to cluster chemical structures from fingerprints or from computed properties.
  • 6.
    Cont. 5. Software toconvert SMILES (or equivalent) into appropriate three-dimensional (3D)– coordinate systems using CONCORD 6. Software to appropriately handle D-optimal design based on multidimensional spaces.
  • 7.
    Methods 1. Assembling theCollection(s) large pharmaceutical companies have acquired compound collections, Reals , that contain a significant number of molecules, including marketed drugs and other high-activity compounds. Reals-a valuable resource that is routinely screened against novel targets.
  • 8.
    Cont. Assembling • suchcollections of structures must include existing sets of commercially available chemicals, or Tangibles— termed this way because one can conceivably acquire them or synthesize them in-house using tractable chemistry . • Thus, any collection prepared for virtual or HTS would sample both the in-house and the “external” chemical spaces. In addition to the Reals and the Tangibles, one can also define the Virtuals—an extremely large set of molecules (1060–10200) that cannot all be made, at least with current chemistry, but that can essentially be used as “resource” for virtual screening.
  • 9.
    Methods 2. Cleaning upthe collection There is no “perfect” chemical database, unless it contains rather simple (e.g., NaCl, H2O) or a rather small number of molecules. The user needs to spend a significant effort in cleaning up the collection, whether it includes Virtuals, Reals, or Tangibles.
  • 10.
    Cleaning up Cont. 2.1Removing Garbage From the Collection 2.2 Verifying Integrity of Molecular Structure 2.3. Generation of Unique, Normalized SMILES
  • 11.
    3. Filtering forLead-Likeness • After cleanup, the collection can be processed to remove compounds that do not have leadlike properties. • It is advisable to cluster the remaining “nonleadlike” set and to include a representative set of these compounds (up to 30%), because they are likely to capture additional chemotypes.
  • 12.
    suggestions for exclusionsaccording to leadlikeness are as follows: 1. More than four rings. 2. More than three fused aromatic rings (avoid polyaromatic rings, because they are likely to be processed by cytochrome P450 enzymes and yield epoxides and other carcinogens). 3. HDO more than 4; HDO ≤ 5 is one of the Ro5 criteria, but 80% of drugs have HDO less than 3 4. More than four halogens, except fluorine (avoid “pesticides”). A notable exception is the crop-protectant business; in such situations, the collection must be processed with entirely different criteria. 5. More than two CF3 groups (avoid highly halogenated molecules). 6. The removal of compounds that contain fragments responsible for cytotoxicity
  • 13.
    Important Note:- • collectionmay t require different processing criteria for different targets and discovery goals; • Eg- targets located in the lung require a different pharmacokinetic profile, • E.g., for inhalation therapy, compared with targets located in the urinary tract that may require good aqueous solubility at pH = 5.0
  • 14.
    Methods cont. 3.4. Searchingfor Similarity If Known Active Molecules are Available
  • 15.
    3.5. Exploring AlternativeStructures The user should seek alternative structures by modifying the canonical isomericSMILES, because these may occur in solution or at the ligand- receptor interface a. Tautomerism, b. Acid/base equilibria c. chiral centers Exploring alternative structures is advisable prior to processing any collection with computational means, such as for diversity analysis
  • 16.
    3.6 Generating 3DStructures • exploring one or more conformers per molecule.- Very Essential
  • 17.
    3.7. Selecting ChemicalStructure Representatives Screening compounds that are similar to known actives increases the likelihood of finding new active compounds, but it may not lead to different chemotypes, a highly desirable situation in the industrial context. The severity of this situation is increased if the original actives are covered by third-party patents or if the lead chemotype is toxic. Clustering methods aim at grouping molecules into “families” (clusters) of related structures that are perceived—at a given resolution— to be different from other chemical families. With clustering, the end user has the ability to select one or more representatives from each family. SMD methods aim at sampling various areas of chemical space and selecting representatives from each area.
  • 18.
    3.7.1 Chemical descriptors •Chemical descriptors are used to encode chemical structures and properties of com- pounds: 2D/3D binary fingerprints or counts of different substructural features, or per- haps (computed) physicochemical properties (e.g., molecular weight, CLogP, HDO, HAC), as well as other types of steric, electronic, electrostatic, topological, or hydro- gen- bonding descriptors.
  • 19.
    3.7.2. Similarity (Dissimilarity) Measure •Chemical similarity is used to quantify the “distance” between a pair of compounds (dissimilarity, or 1 − similarity), or how related the two compounds are (similarity). • The basic tenet of chemical similarity is that molecules exhibiting similar features are expected to have similar biological activity (46). • Similarity is, by definition, related to a particular framework: that of a descriptor system (a metric by which to judge similar- ity), as well as that of an object, or class of objects, reference point with which objects can be compared is needed (47). • Similarity depends on the choice of molecular descrip- tors (48), the choice of the weighting scheme(s), and the similarity coefficient.
  • 20.
    3.7.3. Clustering Algorithms •Clustering algorithms can be classified using many criteria and also implemented in different ways (29–32). Hierarchical clustering methods have been traditionally used to a greater extent, in part owing to computational simplicity. More recently, chemical structure classifications are examining nonhierarchical methods. In practice, the indi- vidual choice of different factors (descriptors, similarity measure, clustering algorithm) depends also on the hardware and software resources available, the size and diversity of the collection that must be clustered, and not ultimately on the user experience in pro- ducing a useful classification that has the ability to predict property values.
  • 21.
    3.7.4. Statistical MolecularDesign • SMD can be applied to rationally select collection representatives, as illustrated for building block selection in combinatorial synthesis planning (55).
  • 22.
    3.8. Assembling Listof Compounds for Acquisition or Virtual Screening • Once provided with an output from one or several methods for compound selection, the now-selected collection representatives are almost ready to be submitted for acquisition or for virtual screening. The end user is encouraged to allow non leadlike molecules to be reentered into the candidate pool. • An additional random, perhaps nonleadlike selec- tion (up to 30%) can, and should, be entered in the final list of compounds.
  • 23.
    Summery 1. Assemble thecollection starting from in-house and on-line databases. 2. Clean up the collection by removing “garbage,” verifying structural integrity, and making sure that only unique structures are screened. 3. Perform property filtering to remove unwanted structures based on substructures, property profiling, or various scoring schemes; the collection can become the virtual screening set at this stage, or it can be further subdivided in a target- and project-dependent manner. 4. Use similarity to given actives to seek compounds with related properties. 5. Explore the possible stereoisomers, tautomers, and protonation state 6. Generate the 3D structures in preparation for virtual screening, or for computation of 3Ddescriptors. 7. Use clustering or SMD to select compound representatives for acquisition. 8. Add a random subset to the final list of compounds. The final list can now be submitted for compound acquisition or virtual screening.
  • 24.