Artificial intelligence techniques in bioinformatics
1. The application of
artificial intelligence
Presented by:
Pallavi Vashistha
techniques in
bioinformatics
2. Outline
• Bioinformatics Today
• Artificial Intelligence application
• Examples:
Symbolic machine learning
Nearest neighbour approach
Clustering
Identification trees
• Major Challenge and Research Issues
3. History of Bioinformatics
Year Subject Name MBP
(Millions of base pairs)
1995 Haemophilus Influenza 1.8
1996 Bakers Yeast 12.1
1997 E.Coli 4.7
2000 Pseudomonas aeruginosa 6.3
A. Thaliana 100
D. Melonagaster 180
2001 Human Genome 3,000
2002 House Mouse 2,500
5. Artificial Intelligence application
There are several important problems where AI
approaches are particularly promising
• Prediction of Protein Structure
• Semiautomatic drug design
• Knowledge acquisition from genetic data
6. Artificial Intelligence application
How to classify biological sequences
• SVM(support vector machine ), Neural Nets,
Decision Trees, Rules
How to cluster biological entities
• Bi-clustering, K-means, hierarchical
How to select features
• LDR (Linear Discriminant Analysis), PCA (Principal
Components Analysis), SVM-RFE (recursive feature
elimination)
7. Nearest neighbour approach
0 Decision tree:
• each node is connected to a set of possible answers,
• each non-leaf node is connected to a test which splits
its set of possible answers into subsets corresponding
to different test results,
• each branch carries a particular test result’s subset
to another node.
8. Nearest neighbour approach
Example: Solution:
0 Example: To see how 0 To answer this question,
decision trees are useful for we need to assume a
nearest neighbour consistency heuristic, as
calculations, consider 8 follows. Find the most
blocks of known width,
height and colour (Winston, similar case, as
1992). A new block then measured by known
appears of known size but properties, for which the
unknown colour. On the property is known; then
basis of existing guess that the unknown
information, can we make property is the same as
an informed guess as to the known property. This
what the colour of the new is the basis of all nearest
block is? neighbour calculations.
9.
10.
11.
12. Clustering
0 Clustering follows the principles of nearest neighbour
calculations but attempts to look at all the attributes
(positions) of biosequences rather than just one
attribute (position) for identifying similarities.
0 This is achieved typically by averaging the amount of
similarity between two biosequences across all
attributes.
0 For example, imagine that we have a table of
information concerning four organisms with five
characteristics:
13. • Given this table, can we calculate how similar each organism is to every other
organism?
• The nearest neighbour approach described earlier would work through the
attributes(‘characteristics’) one at a time. For short bio sequences this may be
feasible, but for bio sequences with hundreds of attributes (e.g. DNA bases) this is
not desirable, since we could probably classify all the samples with just the first
few attributes
14. Clustering can be demonstrated in the following way:
0 The first step is to calculate a simplematching coefficient for
every pair of organisms in the table across all attributes.
0 For instance, the matching coefficient for A and B is the
number of identical characteristics divided by the total
number of characteristics,
0 4/5 = 0.8 (1+0+1+1+1=4/5=0.8). Similarly,
0 A and C = 0.4 (0+0+0+1+1 =2/5 = 0.4)
0 A and D = 0.2 (0+0+0+0+1 = 1/5 = 0.2)
0 B and C = 0.6 (0+1+0+1+1 = 3/5 = 0.6)
0 B and D = 0.4 (0+1+0+0+1 = 2/5 = 0.4)
0 C and D = 0.8 (1+1+1+0+1 = 4/5 = 0.8)
15. • We then find the first highest matching coefficient to form the first 'cluster'of
similar bacteria. Since we have two candidates here (AB and CD both have
0.8), we randomly choose one cluster to start the process: AB.
• The steps are then repeated, using AB as one ‘organism’ and taking partial
matches into account.
• the average matching coefficient for
AB and C = 0.5 (0+0.5+0+1+1 = 2.5/5 = 0.5)
where the 0.5 second match within the parentheses refers to C sharing its
second
feature with B but not A.
• The matching coefficients for AB and D = 0.3 (0+0.5+0+0+1 = 1.5/5 = 0.3)
and for C and D = 0.8 (as before).
• Since C and D have the highest cooefficient, they form the second cluster.
16. Finally, we calculate the average matching coefficients for the new 'clusters'of
organism taking AB as one organism and CD as another organism = 0.4
(0+0.5+0+0.5+1 = 2/5 = 0.4)
again taking partial matches into account. We can then construct a similarity tree
using these coefficients, as follows:
17. Identification tree
The task now is to determine which of the attributes contribute towards someone
being sunburned or not. First, we need to introduce a disorder formula and
associated log values to rank attributes in terms of their influence on who is and
who isn’t sunburned.
18. where nb is the number of samples in branch b, nt is the total number of samples in all
branches, and nbc is the total of samples in branch b of class c.
• The idea is to divide samples into subsets in which as many of the samples have
the same classification as possible (as homogeneous subsets as possible). The
higher the disorder value, the less homogeneous the classification.
• We now work through each attribute in turn, identifying which of the samples fall
within the branches (attribute values) of that attribute, and signify into which
class each of the samples falls
19.
20.
21.
22.
23.
24. Given the full identification tree, we can then derive rules by following
all paths from the root to the leaf nodes, as follows:
0 (a) If a person’s hair colour is brown, then the person is not
sunburned.
0 (b) If a person’s hair colour is red, then the person is
sunburned.
0 (c) If a person’s hair colour is blond and that person has used
sun tan lotion, then the person is not sunburned.
0 (d) If a person’s hair colour is blond and that person has not
used sun tan lotion, then the person is sunburned.
25. Major Challenges and Research Issues
• Requires individuals with knowledge of both
disciplines
• Requires collaboration of individuals from diverse
disciplines
26. Major Challenges and Research Issues
• Data generation in biology/bioinformatics is
outpacing methods of data analysis
• Data interpretation and generation of
hypotheses requires intelligence
• AI offers established methods for knowledge
representation and “intelligent” data
interpretation