Novel algorithms for detection of unknown chemical molecules with specific biological activities

1 / 47
NOVELALGORITHMS FOR DETECTION OF
UNKNOWN CHEMICAL MOLECULES WITH
SPECIFIC BIOLOGICAL ACTIVITIES
By
Ahmed Hassan Ahmed
Supervisors
Prof. Dr. Effat A. Saied Prof. Dr. Aboul Ella Hassanien

2 / 47
Agenda
• Introduction
– Chemo-informatics
– Drug discovery and its cyclic
– Molecular representation
• Problem Statement
• The proposed approaches
• Qualitative structure-activity relationships approach
• Graph algorithms based-approach
• Coding System
• A new atoms similarity algorithm
• A new paths of stars algorithm
• Conclusion and future work

3 / 47
Introduction
Chemo-informatics
Chemistry
Problems
Information
Problems
Cheminformatics is the use of
computer and informational
techniques applied to a range of
problems in the field of
chemistry.
Chemoinformatics / cheminformatics
can be broadly defined as the field of
solving chemical problems with
computers.

4 / 47
Introduction
Drug Discovery and its cycle
• Drug discovery is the process through which potential new
medicines are identified.
https://en.wikipedia.org/wiki/Drug_discovery

5 / 47
Introduction
Molecular Activity and Similarity
• Similarity principle:
• Structurally similar
molecules tend to have
similar properties
• Properties: biological activity,
solubility, color and so on.

6 / 47
Introduction
Molecular Representation
• The structure of a molecule can
be represented by a graph
• Graph = collection of nodes and
edges, nodes and edges have
properties (atomic number,
bond order).
• The molecular graph can be
presented by Connection table
which describes the relationships
between nodes.

7 / 47
Problem Statement
• The development of a new
chemical drug is still a
challenging, time-consuming
and costly process.
• This thesis aims to provide
new approaches to reduce the
time and the cost of chemical
compounds classification.
Thousands of
Candidate
Compounds
Hundreds of
Candidate
Compounds
Approved Drug

8 / 47
The Proposed Approaches
• Molecular Descriptors and Quantitative-structure
activity relationships (QSAR-Model)
• Predicting biological activity of 2,4,6-trisubstituted 1,3,5-triazines using random
forest, in: Proceedings of the 5th International Conference on Innovations in Bio-
Inspired Computing and Applications (IBICA 2014), 2014, pp. 101–110.
• Graph Algorithms
• A new atoms similarity algorithm
• Predicting activity approach based on new atoms similarity kernel function. Journal of
Molecular Graphics and Modelling, Volume 60, July 2015, Pages 55 – 62. (IF 1.754)
• A new paths of stars algorithm
• Two-class support vector machine with new kernel function based on paths of features for
predicting chemical activity. Information Sciences, Volumes 403–404, September
2017, Pages 42-54. (IF 4.832)

9 / 47
Qualitative Structure-activity
Relationships Approach
• A QSARs are statistical models
correlating one or more piece of
response data about chemicals
(chemical compound biological
activity), with the information
numerically encoded in the form
of descriptors (structural
descriptors).

10 / 47
Qualitative Structure-activity Relationships Approach
Molecular Descriptors
• Molecular descriptors are numerical values that characterize properties
of molecules.
Molecular
descriptors
2.54
• The basic implementation of this topology-based descriptor uses the
information contained in the shortest-distance matrix M.
• Topological descriptors.
• Topological indices are global features that derive information from the
adjacency matrix of a molecular graph.
• An example of Topological descriptors is Wiener Index.

11 / 47
QSAR For Predicting Biological Activity Of 2,4,6-trisubstituted
1,3,5-triazines
• QSAR model is presented for predicting
biological activity of 2,4,6-trisubstituted
1,3,5-triazines as cannabinoid receptor
(CB2) agonists using random forest (RF),
decision tree (DT), and support vector
machine (SVM).
• The endocannabinoid system plays an
important role in nervous systems.
• A vast variety of diseases and conditions
have been associated with alterations in the
neurotransmitter system.

12 / 47
Dataset Characteristics And Descriptors
• An investigation to find analogs of
2,4,6-trisubstituted 1,3,5-triazines to
test them as to be CB2 agonist.
• A dataset contains 58 components
and its CB2 agonist activity was
presented.
*Yrj, S., Kalliokoski, T., Laitinen, T., Poso, A., Parkkari, T., Nevalainen, T., Discovery
of novel cannabinoid receptor ligands by a virtual screening approach: Further
development of 2,4,6-trisubstituted 1,3,5-triazines as CB2 agonists. Eur. J. Pharm.
Sci. 48(1-2), 9–20 (2013)

13 / 47
Model Design
1- Molecular
descriptors
• The twenty molecular descriptors were computed to each chemical
component.
2- Redundant
information
• A pair wise correlation analysis for descriptors
• Only one of any two descriptors with r ≥ 0.97 was excluded to reduce
redundant information.
3- Training
and test
• The remaining descriptors were used to train RF to predicate activity
of each component and to compare result we train DT and SVM for
the same data.

14 / 47
The result shows that random forest has the best prediction accuracy with overall
accuracy equal to 100%
Support vector machine has the worst prediction accuracy with overall accuracy
equal to 67%.
The result also shows that Decision tree has 93% overall accuracy of prediction.
Results

15 / 47
Approach Limitation
• Time-consuming
• Computation resources
• Did not give sub-structure
information
• Which descriptors to use?
(Molecular Descriptors for
Chemo-informatics: Volume I
contains more than 3300
descriptors)

16 / 47
Graph Algorithms based-Approach
• New encoding system is presented.
• The encoding system is used to find a relationship
between chemical compounds.
• Presents two new algorithms based on kernel
functions to solve activity prediction problem for
chemical compounds.
• The proposed algorithms were compared with many
other classification methods and the results show
competitive accuracy with these methods.

17 / 47
Treelets
• The proposed algorithms were inspired by the enumeration of subtrees of
graphs up to size 6 nodes.
• The resulting patterns are called treelets.
* B. Gaüzère, L. Brun, D. Villemin, Two new graphs kernels in chemoinformatics,
Pattern Recogn. Lett. 33 (15) (2012) 2038–2047.

18 / 47
Relation Between Treelets and Stars Shape
One star contains 6
treelets
Path of two stars contains 4
treelets

19 / 47
Atoms Coding System
• To encode atoms we create a table
with 5 rows and 118 columns (as the
atoms can be labeled by each of the
118 chemical elements).
• The numbers in the first row represent
central atom type.
• The numbers in the other four rows
represent one type of the chemical
bonds (single, aromatic, double, and
triple bond).
• Numbers in that table are prime
numbers.

20 / 47
Coding Example
• The given atom is encoded as the
following:
• Mapping each atom to a
corresponding prime number
from the given table.
• The product of these prime
numbers represents the code of
the atom which equals 2 × 3 ×5
× 17 = 510.

21 / 47
Unique Codes
• For any two atoms have the same code, it is
clear that they must be of the same type and
have the same neighbors.
• These atoms not only have the same neighbors
but also they are connected to their neighbors
by the same types of chemical bonds.

22 / 47
Paths Of The Feature-coding System
• The path of features is an ordered set of consecutive features.
• Suppose there are n stars in a graph with star codes C1,C2, …,Cn.
• A path of stars of length m in the graph that consists of an ordered set of
star codes, is given by

23 / 47
A New Atoms Similarity Algorithm

24 / 47
Atoms Relationship Matrix
• Matrix S is a squared matrix
which its size equals to the
number of unique star sub-
graphs in dataset.
• Each element at position i, j
in S matrix represents the
number of common
connected sub-graphs
between stars sub-graphs i
and j.

25 / 47
Union Graph
• Let we have chemical compound gi then it has a corresponding list of
atoms codes Ci, where Ci is defined by Eq. (5):
• For any two chemical compounds gi, gj we define a union atoms codes
list CU
i,j between their atoms codes lists Ci and Cj by Eq. (6):

26 / 47
Similarity Vector
• We denote by gs
i,j the similarity between two chemical compounds gi and gj
• Similarity vector vs
i for a given chemical compound gi which belongs to a
dataset D of chemical compounds is the vector that contains the similarity
between gi and each gj∈ D.
• Where h is the size of the dataset D.

27 / 47
Kernel Function
• Two kernel functions are proposed.
• The first one uses the similarity vectors as features for a polynomial kernel
function of order two and we called it the atom similarity kernel Ks.
• The second kernel function uses the outputs of Eq. (7) as kernel matrix entities
and we called it the direct kernel Kd .

29 / 47
Datasets Characteristics
• Two experimental evaluations using two
chemical compounds datasets:
• The monoamine oxidase (MAO) dataset
which is composed of 68 molecules : 38
molecules inhibit the monoamine
oxidase (antidepressant drugs) and 30 do
not.
• The AIDS Antiviral Screen Database of
active Compounds and it is composed of
2000 chemical compounds, these
chemical compounds have been screened
as active or inactive against HIV.
MAO AIDS
Size 68 2000
No.
Atoms
(mean)
18.4 15.7
Atoms
Degree
(mean)
2.1 2.1
Smallest 11 2
Largest 27 95

30 / 47
Prediction Accuracy for (MAO) Dataset

31 / 47
Prediction Accuracy for (AIDS) Dataset

32 / 47
Complexity and Limitation
• For one graph the proposed algorithm needs O(nd)
where for treelets kernel it needs O(nd5)
• Accuracy is effected by the number and type of feature.

33 / 47
A New Paths Of Stars
Algorithm

34 / 47
A New Paths Of Stars Algorithm
Path Of Stars Main Algorithm

35 / 47
Star Relationship Matrices
•The relationship between stars is
represented by two matrices:
• The first matrix is the common sub-stars
matrix denoted by MC.
• The second matrix is the difference sub-
stars matrix denoted by MD.

36 / 47
Star Relationship Matrices
• The first matrix, MC, represents the greatest common connected
subgraph between two stars.
• The second matrix MD contains the greatest connected subgraph that
exists in Si and not in Sj.

37 / 47
Star Relationship Matrices Example
MC MD

38 / 47
Paths of Feature Relationship Matrices

39 / 47
Paths of Feature Relationship Matrices Example

40 / 47
Relationship Vectors and Kernel Functions
• Similarity vectors
• Distance vectors
• The second kernel Kdis uses the distance vectors
• The first kernel Ksim uses the similarity vectors

42 / 47
Prediction Accuracy for (MAO) Dataset

43 / 47
Prediction Accuracy for (AIDS) Dataset

44 / 47
Features Reduction
MAO
• In the proposed algorithm, only 76
paths of stars are counted.
• The proposed algorithm, which is
combined with distance kernel Kdis, has
an accuracy of 94%.
• This accuracy is similar to the accuracy
of the treelets kernel function, which
count 153 different treelets.
AIDS
• In the proposed algorithm, only 4046
paths of stars are counted.
• The proposed algorithm, which is
combined with distance kernel Kdis,
has an accuracy of 99.5%.
• This accuracy is greater than the
accuracy of the treelets kernel function,
which count 4875 different treelets

45 / 47
Complexity
• For one graph the proposed algorithm needs O(nd2) where for treelets
kernel it needs O(nd5)

46 / 47
Conclusion & Future Works
• Two approaches were proposed in this thesis:
• The quantitative structure activity relationship (QSAR) approach and it was tested with three
classifiers RF, DT, SVM.
• The approach is tested to predicate biological activity of CB2 receptors agonists and the result shows
that RF has the best prediction accuracy flowed by the DT and the least accuracy was obtained with
SVM.
• The graph algorithms approach
• New Coding System based on primes numbers.
• A new atoms similarity algorithm is proposed with complexity O(nd) and tested on:
• MAO dataset with accuracy 91% and improved by features selection to 98.5%.
• ADIS dataset with accuracy 99.2% and improved by features selection to 99.4%.
• A new paths of stars algorithm is proposed with complexity O(nd2) and tested on:
• MAO dataset with accuracy 94% and reduced the features from 153 to 76.
• ADIS dataset with accuracy 99.5% and reduced the features from 4875 to 4046.
• The proposed algorithms can be further upgraded to improve its performance and
accuracy.
• This modification can be done by using additional types of features in the paths rather than star-
type features only.
• These features can be chosen to be (3D) instead of (2D).

47 / 47
Publications
Two-Class Support Vector Machine with New Kernel
Function Based on Paths of Features for Predicting
Chemical Activity, Information Sciences, Volumes
403–404, September 2017, Pages 42-54 (IF= 4.838 &
SJR = 2.513)
http://www.sciencedirect.com/science/article/pii/S00200
25517306448
Predicting activity approach based on new
atoms similarity kernel function, Journal of
Molecular Graphics and Modelling, Vl. 60,
pp. 55–62, 2015 (IF= 1.674 & SJR=0.480)
http://www.sciencedirect.com/science/ar
ticle/pii/S1093326315300061

Novel algorithms for detection of unknown chemical molecules with specific biological activities

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Novel algorithms for detection of unknown chemical molecules with specific biological activities

Similar to Novel algorithms for detection of unknown chemical molecules with specific biological activities (20)

More from Aboul Ella Hassanien

More from Aboul Ella Hassanien (20)

Recently uploaded

Recently uploaded (20)

Novel algorithms for detection of unknown chemical molecules with specific biological activities