Your SlideShare is downloading. ×
Background Report (DOC)
Background Report (DOC)
Background Report (DOC)
Background Report (DOC)
Background Report (DOC)
Background Report (DOC)
Background Report (DOC)
Background Report (DOC)
Background Report (DOC)
Background Report (DOC)
Background Report (DOC)
Background Report (DOC)
Background Report (DOC)
Background Report (DOC)
Background Report (DOC)
Background Report (DOC)
Background Report (DOC)
Background Report (DOC)
Background Report (DOC)
Background Report (DOC)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Background Report (DOC)

304

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
304
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server Abstract Much of the reason for the high cost of medicines is rooted in the length and complexity of the development and approval process. At every possible stage of development, it is possible that a potential drug (leader) will fail to gain approval on the basis that it produces erratic results or harmful side effects. Predictive toxicology aims to reduce the money and time spent by identifying as early on in the drug development process as possible leaders that are likely to fail. Numerous machine learning techniques exist to identify such leaders. Here we present a possible solution based on the Find a maximally specific hypothesis (Find-S) algorithm. This algorithm, given a set of positive and negative examples of data, finds substructures that are statistically true of the majority of positive compounds, and statistically not true of the negative compounds. A discussion of the algorithm and its motivation is presented here. i
  • 2. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server Contents Abstract...................................................................................................................i Contents................................................................................................................ii 1.Introduction........................................................................................................3 1.1.Motivation..........................................................................................................................3 1.2.Summary of Report...........................................................................................................4 2.Previous Research..............................................................................................5 1.3.Structure-Activity Relationships.......................................................................................5 1.4.Attribute-based representations........................................................................................5 1.5.Relational-based representations......................................................................................7 1.6.Inductive logic programming...........................................................................................7 3.The Find-S Technique.......................................................................................9 1.7.Motivation.........................................................................................................................9 1.8.General-to-specific ordering of hypotheses......................................................................9 1.9.The Find-S algorithm......................................................................................................10 1.10.Algorithm evaluation methods.......................................................................................14 1.11.Issues with the Find-S technique...................................................................................15 1.12.Existing Prolog implementation....................................................................................16 4.Implementation Considerations.......................................................................18 1.13.Representing structures.................................................................................................18 1.14.Improvement of current implementation......................................................................18 1.15.Extensions......................................................................................................................18 5.References.........................................................................................................20 ii
  • 3. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server 1. Introduction 1.1. Motivation Each year, drug companies release new and improved drugs, claiming that they produce better results with fewer side effects. However, the cost of such advances in the drug industry is not small. Developing a drug from the theoretical stage to it appearing on pharmacy shelves normally takes in the region of 10 to 15 years, at an average cost of over £500 million [see ref 1]. This outlay by the drug company must be covered by the consumer for the company to remain in profit, and evidence of this can be seen, for example, in the regular rise of NHS prescription charges. Much of the reason for the high cost of medicines is rooted in the length and complexity of the development and approval process. At every possible stage of development, it is possible that a potential drug (leader) will fail to gain approval on the basis that it produces erratic results or harmful side effects. Even after promising lab tests, further experiments on animal specimens often return ideas to the drawing board. It is estimated that for every one drug that reaches clinical (human) trial stage, another 1000 have failed earlier testing. Despite this, it is important to note that medicines still reduce overall medical care costs by reducing even more expensive hospitalisation, surgery or other treatments. Drugs are the primary way of controlling the outcomes of chronic illness. Therefore, the development of new drugs is important for both patient care and for the positive long-term financial implications. It is clear that reducing the number of drug leaders developed at an early stage will have a significant effect in limiting development costs. Determining at an early stage that a leader is unsuitable for further testing saves the investment that may otherwise have been spent on this drug, only for the same conclusion to be reached. For this reason, the field of predictive toxicology was born. It is an effort on the part of biotechnology companies to predict in advance whether or not a drug will be toxic, using various techniques learnt from the fields of statistics, artificial intelligence (AI), and machine learning. Negative effects of a drug can range from relatively minor problems such as headaches and stomach upsets, to potentially life-threatening organ damage. While many accepted drugs do produce some side effects for some patients, the value of the treatment is always said to outweigh the side effects. However, there are certain characteristics of chemical compounds that will limit their effectiveness as a drug. Predictive toxicology aims to find this drug toxicity while still in the planning stages. Ruling out a leader at this early stage saves it being synthesised and tested, and allows resources to be focused on more promising areas of research. Machine learning programs in a variety of different guises have been used to try and discover the reasons why certain chemicals are toxic and others are not. Essentially, they learn a concept that is true of the toxic drugs and false for other non-toxic drugs. These derived concepts are usually small Page 3
  • 4. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server (around five or six atoms) sub-structures of the larger drug molecule, where some of the atoms are fixed elements and others may vary. The task in hand is to effectively and efficiently identify such sub-structures using the Find Maximally Specific Hypothesis (FIND-S) machine learning algorithm. An implementation of the algorithm has been written in PROLOG by S Colton; our work here is based on extending this implementation and producing a web-based server application. A molecule is said to be positive if it contains the sub-structure in question. Conversely, it is said to be negative if it does not. The application will return interesting substructures given positive and negative molecules, whereby the substructure is true of statistically significant more positives than negatives. 1.2. Summary of Report This report is an overview of the research undertaken, with an outline of how implementation of a Substructure Server may proceed. Section 2 summarises the machine learning techniques used in the field of predictive toxicology, and introduces the concepts of attribute-based and relationship-based structure-activity relationships. Section 3 is a comprehensive overview of the Find-S algorithm, with an emphasis on how it may perform in a predictive toxicology situation. A fictional example is presented and analysed which demonstrates the key methodologies of the technique. Evaluation techniques applicable to both the algorithm itself and to the results it produces are outlined, as well as various considerations that should be addressed on implementation. S Colton’s existing Prolog implementation of the algorithm is also discussed. Section 4 highlights some implementation considerations, suggesting a possible course of action towards building a substructure server available for public use. Page 4
  • 5. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server 2. Previous Research As was mentioned above, machine learning algorithms to find relevant sub-structures have been applied in the field of predictive toxicology. It is important to understand the approaches that have been taken in previous work, using it as a basis for further study. A summary of the key features of background study undertaken is summarised in this section. 1.3. Structure-Activity Relationships A structure-activity relationship (SAR) models the relationship between activities and physicochemical properties of a set of compounds [2]. The goal of our work is essentially to form SARs from the given input molecules. These resultant SARs represent the molecules most likely contribute to toxicity, as calculated by our algorithm. A SAR is derived from two components: • The learning algorithm employed during derivation, and • The choice of representation to describe the chemical structure of the compounds being considered. The learning algorithm used will rule out possible choices of representation, as the latter has to be rich enough to support the algorithm’s procedure. SARs can store different information about compounds, and typically such information (attributes) could consist of any of the following chemical properties [5]: • Partial atomic charges • CMR • Surface area • pKa, pKb • Volume • Hansch parameters π, σ, F • H_bond donors/acceptors • Molecular grids • ClogP • Polarisability The exact nature or meaning of each attribute type need not be discussed here. It is however important to note that there are any number of ways of representing a compound, using any combination of the attributes given above (and more). 1.4. Attribute-based representations A large variety of learning techniques are in use that derive SARs of different forms. The majority of these are based on examining the types of attributes listed above. A short summary of a few of these techniques is presented here. Page 5
  • 6. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server 1.4.1. Linear and Partial least-squares regression Linear regression was the first learning algorithm employed in predictive toxicology, as detailed by Hansch et al. [3]. “Training” the system involves providing suitable training examples, which are simply saved to memory without being interpreted or compared in any way. It is on this stored information (as explicitly provided by the user) that regression aims to approximate its target function. In the context of predictive toxicology, this would involve supplying examples of positive compounds as training data. The procedure then run on a new compound would invoke a set of similar compounds being retrieved from the stored values, and use this to classify the new compound. The analysis of the compounds is based on chemical attributes as specified by the algorithm; Hansch used global chemical properties of the molecule (LogP and π). Least-squares regression is another learning technique involving the relationship between chemical attributes. Visually it essentially entails forming a ‘line of best fit’ for a set of training data plotted against two variables y and x, where x and y are two chemical attributes. For any new compound encountered, a plot is made of the same two attributes; if the point produced lies within a fixed bound of the line of best fit, then the new compound can be deemed positive. The system can be extended to include multiple independent variables, and to give each variable different weights – a measure of how important each attribute measure is compared with each other. It is important to note that both these techniques make no attempt to interpret the training data as it is fed to them; all the processing of determining suitability criteria for new compounds happens only once the new compound has been encountered. 1.4.2. Decision trees Decision trees classify the training data by considering each <attribute, value> pair (tuple) for a given compound [4]. Each node in the tree specifies a test of a particular attribute, and each branch descending from that node corresponds to a possible value for that attribute. A compound is classified as positive or negative at the leaf nodes of the graph. New compounds are classified by comparing their attribute values to ones stored from the training data. An implementation of this algorithm needs to address the critical issue of which attribute(s) to perform the test on. This decision could crucially alter the classification schema, and is a problem inherent in trying to separate objects into discrete sets when their behaviour or identity is given by a number of attribute. It is possible that any two attribute values could contradict each other on a particular classification scheme, and it then becomes necessary to impose some ordering or priority system over the attributes. Page 6
  • 7. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server 1.4.3. Neural networks Artificial Neural Networks (ANNs) provide a general and practical method for learning functions from examples [4], and have widespread use in AI applications. Predictive toxicology lends itself to the use of ANNs because of how compound attributes can be treated as <attribute, value> tuples, in a manner similar to that discussed in section 2.1.2 above. A compound can be represented by a list of such tuples covering the full range of attributes. The simplest form of ANN system is based on perceptrons, which will take the list of tuples and calculates a ‘score’ for the compound. This score is calculated from a combination of the input tuples, and a weight associated with each attribute. The algorithm can learn from the training data by considering the attributes of positive compounds, and can then classify unknown compounds as positive or negative, depending on the score calculated being higher than a defined threshold. Practical ANN systems usually implement the more advanced backpropogation algorithm, which learns the weights for a network of neural nodes on multiple layers. However the principal is the same as that used in the perceptron algorithm, with the compound score being calculated in a non-linear manner taking into account more variables. 1.5. Relational-based representations The techniques mentioned above for deriving SARs all share one key concept: they are all based on attributes of the object (in our case, the chemical compound being examined). These attributes can be considered to be global properties of these molecules, e.g. using the molecular grid attribute maps points in space, which are global properties of the coordinate system used. The tuple of attributes that has been used to represent the properties of the molecule is not an ideal format; it is difficult to efficiently map atoms and the bonds onto a linear list. A more general way to describe objects is to use relations. In a relational description the basic elements are substructures and their associations [2]. This allows the spatial representation of the atoms within the molecule to be represented more accurately, directly and efficiently. 1.6. Inductive logic programming Fully relational descriptions were first used in SARs with the inductive logic programming (ILP) learning technique, as shown in [6]. ILP algorithms are designed to learn from training examples encoded as logical relations. ILP has been shown to significantly outperform the feature (attribute) based induction methods described above [7]. ILP for SARs can be based on knowledge of atoms and their bond connectives within a molecule. Using this scheme has a number of benefits: Page 7
  • 8. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server • Simple, powerful, and can be generally applied to any SAR • Particularly well suited to forming SARs dependent on the relationship between the atoms in space (shape) • Chemists can easily understand and interpret the resultant SARs as they are familiar with relating chemical properties to groups of atoms. The formal difference between the descriptive properties of attribute and relational SARs corresponds to the difference between propositional and first-order logic [2]. ILP involves learning a set of “if-then” rules for a training set, which can then be applied to unseen examples. Sets of first- order Horn clauses can be constructed to represent the given data rules, and these can be interpreted in the logic programming language PROLOG. ILP differs from the attribute based techniques in two key areas. ILP can learn first-order rules that contain variables, whereas the earlier algorithms can only accept finite ground terms for attribute values. Further, ILP sequentially examines the data set, learning one rule at a time to incrementally grow the final set of rules. We stated above that relational SARs can be described by fist-order predicate logic. The PROGOL algorithm was developed [8] to allow the bottom-up induction of Horn clauses, and is implemented in PROLOG. PROGOL uses inverted entailment to generalize a set of positive examples (active compounds) with respect to some background knowledge – atom and bond structure date, given in the form of prolog facts. PROGOL will construct a set of “if-then” rules which explain the positive (and negative) examples given. In the case of predictive toxicology, these rules generally specify a sub-molecular structure of around five or six atoms. These structures are those that have been calculated to contribute to toxicity, based on their presence in the set of positive training examples, and their non-presence in the set of negative training examples. These sub-structures can then be matched with components of unseen compounds in an attempt to predict toxicity. Page 8
  • 9. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server 3. The Find-S Technique 1.7. Motivation As mentioned previously, the focus of this research topic is to use the Find-S algorithm as described below to identify the sub-structures discussed at the end of section 2.3.1. Within the scope of predictive toxicology, it may appear that both Find-S and ILP do the same thing, however this is not the case. The Find-S technique differs from that of ILP due to the motivation behind the process. ILP looks for concepts that are true for positive examples, and false for negative examples, and produces a sub-molecule structure as a result. The Find-S procedure, on the other hand, is given a template (by the user) to guide its search, and the program looks for all possibilities of the general shape in the positive inputs. 1.8. General-to-specific ordering of hypotheses Any given problem has a predefined space of potential hypotheses [4], which we shall denote H. Consider a target concept T, whose truth value (1 or 0) depends upon the values of three attributes, a1, a2, and a3. Each attribute a1, a2, or a3 can take a range of discrete values, some combinations of which will make T true, others will make T false. We denote the value x of an attribute an as v(an) = x. We can let each hypothesis consist of a conjunction of constraints on the attributes, i.e. take the list of attribute values for that particular instance of the problem. This list of attributes (of length three in this case) can be held in a vector. For each attribute an, the value v(an) will take one of the following forms: • ? - indicating that any value is acceptable for this attribute • ∅ - indicating that no value is acceptable for this attribute • a single required value for the attribute, e.g. for an attribute ‘day of week’, acceptable values would be ‘Monday’, ‘Tuesday’ etc. With this notation, the most general hypothesis for T is <?, ?, ?> which states that any assignment to any of the three attributes will result in the hypothesis being satisfied. Conversely, the most specific hypothesis for T is <∅, ∅, ∅> which states that no assignment to any of the variables will ever satisfy the hypothesis. All hypotheses within H can be represented in this way, with the majority falling somewhere between the two above extremes of generality. Indeed, hypotheses can be ordered on their generality, Page 9
  • 10. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server from most general to most specific instances. For example, consider the following two possible hypotheses for T: h1 = <x, ?, y> h2 = <?, ?, y> Considering the two sets of instances that are classified positive by the two hypotheses, we can say that any instance classified positive by h1 will also be classified positive by h2, as h2 imposes fewer constraints. We say that h2 is more general than h1. Formally, for two hypotheses hj and hk, we can define hj to be more general than or equal to hk (written h j ≥ g h k ) if and only if (∀x ∈ X) [(h k (x) = 1) → (h k (x) = 1)] Further, we can define hj to be (strictly) more general than hk (written h j > g h k ) if and only if (h j ≥ g h k ) ∧ (h j ≱ g h k ) 1.9. The Find-S algorithm The Find-S technique orders hypotheses according to their generality as explained in the previous section. The algorithm then starts with the most specific hypothesis h possible within H. For each positive example it encounters in the training set, if generalises h (if needed) so h now correctly classifies the encountered example as positive. After considering all positive training examples, the resultant h is output. This is the most specific hypothesis in H consistent with the examined positive examples. The algorithm can be more formally defined as follows [4]: 1. Initialise h to the most specific hypothesis in H. 2. For each positive training instance x  For each v(ai) in h • If v(ai) is satisfied by x Then do nothing • Else replace ai in h by the next more general constraint that is satisfied by x. 3. Output hypothesis h The procedure is run with a different starting positive each time until all positives have been analysed. There is a question over how to measure how specific a particular hypothesis is. This is Page 10
  • 11. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server dependent on the representation scheme, but in first-order logic, for example, a more specific hypothesis will have more ground terms (fewer variables) in the logic sentence describing it than a less specific hypothesis. 1.9.1. A simple example An example to illustrate how the algorithm could be used in predictive toxicology is presented below. It has been adapted from [9], and is fabricated in that the derived structure is not a real indicator to toxicity. The example is simply illustrates the algorithm process. Training Data Consider the training set of seven drugs, four of which are known positives, and the remaining three known negatives. Diagrams of these molecules are given below, with molecules P1, P2, P3 and P4 representing positive examples, and N1, N2 and N3 representing negative ones. The atom labels (α, β, µ, and ν) are used in place of possible real elements (e.g. N, C, H etc) to enforce the notion that the example is purely fabricated. α P1 β µ ν µ α β ν α N α α ν α P2 β β α α µ α α β β α N α α ν µ α P3 β β α α µ α α β ν β α N µ µ P4 β β β β Figure 1: Training set for Find-S example Page 11
  • 12. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server At this stage, the chemist (user) must suggest a possible template on which to base the search for toxicity-inducing substructures. It is thought that a substructure of the form ATOM  ATOM  ATOM (with  representing a bond) contributes to toxicity. It is now the task of the algorithm to find sub-molecules matching the structure given above which exist in as many positives as possible, but do not exist in as many negatives as possible. The Algorithm Procedure To solve the problem, we use the Find-S method with the aim of producing solutions of the form <A, B, C> where A, B and C are taken from the set of chemical symbols present in the molecules, i.e. {α, β, µ, ν}. However, we also need to look for general solutions where an atom in a particular position is not fixed. We therefore append {?} to the previous set, giving {α, β, µ, ν, ?}. We start off with the most specific hypothesis possible. Any final concept learned will have to be true of at least one positive example. We use this to produce our first set of triples: <α, β, µ> and <β, µ, ν> These are the two substructures that exist in P1 and match the template specified. We now check whether each of these substructures is true in the next molecule (P2). If they are not, then we generalise the substructure such that it becomes true in P2. This generalisation is done by introducing as few variables as possible. In doing this, we find the least general generalisations, which then guarantees that our final answers are as specific as possible. This expanded set of substructures is then tested on P3, and following the same procedure, on P4. Page 12
  • 13. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server A trace of the intermediate results produced is shown here: Molecule being analysed P1 P2 P3 P4 <α, β, µ> <α, β, µ> <α, β, µ> <α, β, µ> <β, µ, ν> <β, µ, ν> <β, µ, ν> <β, µ, ν> <α, β, ?> <α, β, ?> <α, β, ?> <β, ?, ν> <β, ?, ν> <β, ?, ν> <?, β, µ> <?, β, µ> <?, β, ?> <?, β, ?> <α, ?, ?> <α, ?, ?> <β, ?, ?> <β, ?, ?> <?, ?, ν> <?, ?, ν> The trace shows previously derived substructures with a greyed out background. Note that no new substructures are produced on analysis of P4 – all the substructures produced after analysis of P3 match exactly components of P4 without the need for generalisation. Evaluation of Results So the algorithm has now returned nine possible hypotheses for substructures that determine toxicity. These can now be scored, based on • How many positive molecules contain the substructure derived • How many negatives do not contain the substructure derived A calculation of scores is given below: Correctly classified Correctly classified positives: negatives: Hypothesi P1 P2 P3 P4 N1 N2 N3 Accuracy s 1 <α, β, µ>    43% 2 <β, µ, ν>     57% 3 <α, β, ?>     57% 4 <β, ?, ν>       86% 5 <?, β, µ>     57% 6 <?, β, ?>     57% 7 <α, ?, ?>    43% 8 <β, ?, ?>     57% Page 13
  • 14. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server 9 <?, ?, ν>     57% It can be seen that the most accurate hypothesis derived is number four: <β, ?, ν>. This is statistically the most frequent substructure (of the form ATOM  ATOM  ATOM) that occurs in the positives, but not in the negatives. This structure can then be used to predict the toxicity of unseen compounds; other compounds containing a match for hypothesis four are statistically likely to be toxic. For a complete implementation of the algorithm, the procedure should be repeated, but this time with P2 as the initial positive, and generalising on the others. The same should be applied for P3 and P4 as initial positives. 1.10.Algorithm evaluation methods On obtaining a ‘result’ from the Find-S algorithm, i.e. a hypothesis (or set of hypotheses) representing a sub-molecule thought most likely to contribute to toxicity, it is desirable to have some certainty that the result obtained is indeed accurate. We want the promising results obtained with the training set to be extended to unseen examples. There is no way to guarantee the accuracy of a hypothesis, however there are accepted methods and measures through which a user can become more confident in the results obtained. In our example above, the ‘best’ hypothesis had a (predicted) accuracy of 86%, calculated by considering the number of correctly classified positives and negatives, over the total number of compounds analysed. However, this figure is based purely on the examples that the hypothesis has already seen; it is not a strong indicator of accuracy for unseen examples. 1.10.1.Cross validation One possible way of addressing this situation is to reserve some examples from the training set, and then subsequently use these reserved examples as tests on the derived hypothesis. The results of the hypothesis applied to the reserved examples can then be compared to their actual categorisation, which is known as they were provided as part of the training set. This cross validation is a standard machine learning technique, and the splitting of initial example data into a training set and test set can give the user more confidence that the derived hypothesis will be accurate and of use. Clearly, it can have the opposite effect, with a user finding out that the derived hypothesis in fact performs poorly on genuinely unseen examples. 1.10.2.K-fold cross validation Page 14
  • 15. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server It is often of importance and interest that the performance of the learning algorithm itself is measured, and not just a specific hypothesis. A technique to achieve this is k-fold cross validation [4]. This involves partitioning the data into k disjoint subsets, each of equal size. There are then k training and testing rounds, with each subset successively acting as a test set, and the other k-1 sets as training sets. The average accuracy rate can then be calculated from each independent test run. This technique is typically used when the number of data objects is in the region of a few hundred, and the size of each subset is at least thirty. This ensures that the tests provide reasonable results, as having too few test examples would result in skewed accuracy figures. As each round is performed independently, there is no guarantee that the hypothesis generated on one training round will be the same as the hypothesis generated on another. It is for this reason that the overall accuracy figures generated are representative of the algorithm as a whole, not just one particular result. 1.11.Issues with the Find-S technique As with all machine learning techniques, Find-S has some factors to encourage its use, and others that make it less favourable. Some of these considerations are discussed here. 1.11.1.Guarantee of finding most specific hypothesis As the name of the algorithm suggests, the process is guaranteed to find the most specific hypothesis consistent with the positive training examples, within the hypothesis space. This is because of the decisions made to select the least general generalisations when analysing compounds. This property can be viewed as being both advantageous and disadvantageous. It is sometimes useful for users to know as much information about the substructure as possible, and this may enable them to better understand the chemical reason for the molecule’s toxicity. However, in the case of an example deriving multiple hypotheses consistent with the tracing data, the algorithm would still return the most specific, even thought the others have the same statistical accuracy. Further, it is possible that the process derives several maximally specific consistent hypotheses [4]. To account for this possible case, we need to extend the algorithm to allow backtracking at choice points for generalisation. This would find target concepts along a different branch to that first explored. 1.11.2.Overfitting Overfitting is often thought of as the problem of an algorithm memorising answers rather than deducing concepts and rules from them, and is inherent in many machine learning techniques. A particular hypothesis is said to overfit the training examples when some other Page 15
  • 16. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server hypothesis that fits the training examples less well, actually performs better over the whole set of instances (i.e. including non-training set instances). Overfitting can occur when the number of training examples used is too small and does not provide an illustrative sample of the true target function. It can also occur when there are errors in the example data, known as noise. Noise has a particularly detrimental effect on the Find-S algorithm, as explained below. 1.11.3.Noisy data Any non-trivial set of data taken from the real world is subject to a degree of error in its representation. Mistakes can be made analysing the data and categorising examples, in translation of information from one form to another, and repeated data not being consistent with itself. In machine learning terms, such errors in the data are termed noise. While certain algorithms are fairly robust to noise in data, the Find-S technique is inherently not so. This is because the algorithm effectively ignores all negative examples in the training examples. Generalisations are made to include as many positive examples as possible, but no attempt is made to exclude negatives. This in itself is not a problem; if the data contains no errors, then the current hypothesis can never require a revision in response to a negative example [4]. However, the introduction of noise into the data changes this situation. It may no longer be the case that the negative examples can simply be ignored. Find-S makes no effort to accommodate for these possible inconsistencies in data. 1.11.4.Parallelisability The Find-S algorithm lends itself well to a parallel distributed implementation, which would speed-up computation time. A parallel implementation could involve individual processors being allocated different initial positives; recall above that the algorithm is only complete when hypotheses have been derived using each possible start positive. The derivation of any particular hypothesis from an initial positive can be run independently, and hence can be run in parallel with other derivations. 1.12.Existing Prolog implementation S. Colton has implemented an initial version of the Find-S algorithm in PROLOG. This relatively compact program (approximately 300 lines of code) identifies substructures from a sample data set as used by King et al [2]. The program is guided by substructure templates, of which a few have been hard coded. It has recreated some of the results produced by the ILP method and PROGOL on the sample data set considered. The program can take parameters to specify the minimum number of ground terms that must appear in a resultant hypothesis (i.e. limit variables), and also specify the minimum number of molecules for which a hypothesis should return TRUE for a positive, and the maximum for which it can FALSE for a negative. Page 16
  • 17. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server An important point for discussion here is the representation of the background and structural data. Information representing the molecules is represented as a series of facts in a PROLOG database. The representation is identical to that suggested in the section on inductive logic programming, and involves storing information about atoms and their inter bonding. The data stored for even a single molecule is extensive; however these PROLOG facts can be generated automatically as mentioned in section 4.1. Page 17
  • 18. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server 4. Implementation Considerations The Find-S algorithm has been discussed at length as it represents the core component of a system to identify substructures. However, the initial remit was to create a substructure server, whereby users would be able to identify potentially interesting substructures from their positive and negative examples. As such, other considerations need to be examined, and these are summarised here. 1.13.Representing structures There exists a conflict between the natural user representation of chemical structures, and those that are useful to the implemented algorithm. In a sense, the users’ view of structures must be parsed into the computer view (first order logic) at some stage, either by the user manually, or by the implemented software as pre-processing to the Find-S algorithm. It is clearly more desirable from the users’ position that this conversion is done in an automated fashion. The feasibility of this is briefly discussed here. Chemists are often concerned with modelling compounds, and the industry standard modelling software is QUANTA [9]. King et al. in [2] used QUANTA editing tools to automatically map a visual representation of a molecule into first order logic. After some suitable pre-processing, this mapped representation could be read by their PROGOL program as a series of facts. Another molecular simulation program, CHARMM [10], stores as data files information about the molecule being simulated. These data files use standard naming and referencing techniques, as described in The Protein Data bank [11]. The structure of these flat text files is conducive to translations to other formats, on development of suitable schema. 1.14.Improvement of current implementation S Colton’s current implementation of the Find-S algorithm can serve as a basis for further work. The algorithm could be recoded in a modern object oriented language, which would facilitate parallelising and packaging the algorithm as a web-based application. One key improvement that could be made is with the introduction of new search templates. These templates guide the algorithm, restricting its search to sub-molecules matching the specified template. Currently only a small number of templates are implemented; it is desirable that more be available to the user. 1.15.Extensions As advanced work in this area, further extensions to those suggested above are possible. Implementing the algorithm in parallel is one such possible extension. This would speed up the potentially highly complex and time consuming derivations of hypotheses. Page 18
  • 19. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server There is also scope for the generated hypotheses to be represented in different formats. While an answer returned in first order logic maybe strictly accurate, it is unlikely to be of much use to a user with little or no knowledge of computational logic techniques. Molecular visualisation software such as RASMOL and the later PROTEIN EXPLORER [12] exist, that can take as input data in a similar format to that produced by QUANTA or CHARM. It would be desirable for a user to view the resultant hypotheses, with the sub-molecule derived by the algorithm presented visually. Page 19
  • 20. Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server 5. References [1] Ellis, L., Aetna InteilHealth Drug Resource Centre, From Laboratory To Pharmacy: How Drugs Are Developed, 2002. http://www.intelihealth.com/IH/ihtIH/WSIHW000/8124/31116/346361.html?d=dmtContent [2] King, Ross D., Muggleton, Stephen H., Srinivasan, A. & Sternberg, Michael J.E., Structure-activity relationships derived by machine learning: The use of atoms and their bond connectives to predict mutagenicity by inductive logic programming (1995) Proceedings of the National Academy of Sciences (USA) 93, 438-442 [3] Hansch, C., Maloney, P. P., Fujita, T. & Muir, R. M., Correlation of Biological Activity of Phenoxyacetic Acids with Hammett Substituent Constants and Partition Coefficients (1962). Nature (London) 194, 178-180 [4] Mitchell, T. M., Machine Learning, International Edition, 1997, McGraw-Hill [5] Glen,B., Molecular Modelling and Molecular Informatics, University of Cambridge – Centre for Molecular Infomatics, www-ucc.ch.cam.ac.uk/colloquia/rcg-lectures/A4 [6] Muggleton, S., Inductive Logic Programming (1991), New Generation Computing 8, 295-318 [7] Srinivasan, A., Muggleton, S. H., Sternberg, M. J. E., King, R. D., Theories for mutagenicity: a study in first- order and feature-based induction (1996), Artificial Intelligence 85(1,2), 277-299 [8] Muggleton, S., Inverse Entailment and Progol (1995), New Generation Computing 13, 245-286 [9] Colton, S. G., Lecture 11 – Overview of Machine Learning, Imperial College London, 2003. http://www2.doc.ic.ac.uk/~sgc/teaching/341.html [9] Quanta software, http://www.accelrys.com/quanta/, Accelrys Inc. [10] Chemistry HARvard Molecular Mechanics (CHARMM), http://www.ch.embnet.org/MD_tutorial/pages/CHARMM.Part1.html [11] The Protein Data Bank, http://www.rcsb.org/pdb [12] Rasmol Home Page, http://www.umass.edu/microbio/rasmol/ Page 20

×