1. 1
Machine Learning Applications in
Systems Biology
Natasha Alves, M.A.Sc. Candidate, ECE
inherent redundancy in many pathways and feedback
Abstract— Recent advances in high-throughput systems.
technologies have led to an immense flow of biological A lot of useful and important information about biological
data. Extracting the information hidden in the ever- systems is hidden in high volumes of experimental data. For
expanding biological databases has been an obstacle in the instance, there are 37 billion bases of DNA in 32,000
progress of systems biology. Machine Learning has sequence records in GenBank alone (Feb. 2004)[12].
proved to be an efficient and inexpensive approach to Analyzing high volumes of data to understand biological
organizing data; developing new tools to analyze data; systems demands tedious experimentation and modern
and discovering new knowledge from data. This paper computational technology. This is the grand challenge for
introduces Machine Learning techniques like inductive systems biology in this era.
logic programming, clustering, Bayesian networks, and An intelligent approach is needed to extract the hidden
decision trees in the context of their applications in information from the data and to cope with the rapid rate of
systems biology. The shortcomings of these Machine data deposition.
Learning techniques are also addressed.
Index Terms—Artificial Intelligence, Bayesian Networks,
Clustering, Decision Trees, Inductive Logic Programming, III.MACHINE LEARNING
Machine Learning, Systems Biology. Machine Learning (ML) is the capability of computer
algorithms to improve automatically through experience (i.e.
the computer programs itself by seeing examples of the
I.INTRODUCTION behavior we want). ML approaches are ideally suited for
Systems Biology is an in-depth, systems-level analysis of domains characterized by the presence of large amounts of
biological systems grounded on the molecular level [1]. It is data, noisy patterns and the absence of general theories [4].
different from other methods of biological study where the The fundamental idea behind these approaches is to learn the
focus is on the characteristics of isolated parts of a cell or theory automatically from the data through a process of
organism. Systems biology examines the structure and inference and model fitting. A system that can learn from
dynamics of cellular and organism functions, and their experience and improve its performance automatically could
interconnections and interrelationships. One ultimate goal of serve as a tool for solving biological systems.
systems biology is to use the knowledge of the complete The main goal of ML is to induce general functions from a
genome sequence and all proteins encoded by that genome to specific training data set. The learning agent is given a set of
reconstruct the biological systems that are implied [2]. training examples, and it defines the hypothesis for them.
The development of systems biology is driven by The agent must search through the hypothesis space and
technology. Sophisticated computational techniques are locate the best hypothesis when given the test set [5].
needed to analyze biological systems because of the Because ML is concerned with learning from data
complexity and dynamics involved. Machine Learning, examples, it often uses a probabilistic approach.
which is an automatic and intelligent learning technique, has
for long been used to discover meaningful associations
between proteins, and for scientific hypothesis formation [3]. IV.OVERCOMING THE CHALLENGES IN SYSTEMS BIOLOGY
The aim of this paper is to introduce Machine Learning ML approaches have gained popularity in systems biology.
techniques in the context of their application in systems The characteristics of ML that make it well suited for
biology. systems biology are:
1. Many problems in biological systems are not well
defined, but have a lot of experimental data. ML is
II.CHALLENGES IN SYSTEMS BIOLOGY useful when the structure of the task is not well
Much of our failure to fully understand biological systems understood but the task can be characterized by a data
has been due to their size and complexity. Systems biology set with strong statistic regularity. While input/output
emphasizes on large-scale discovery of the interactions of pairs can be easily specified, the relationship between
genes, proteins, and other cell elements. It is confronted with the inputs and outputs are often unknown (e.g. the
dynamic biological responses, a huge number of interactions, protein folding mechanism). ML approaches can
extract relationships and correlations hidden under
Manuscript received November 1, 2004
2. 2
large volumes of data (data mining). It could thus 1)Inductive Logic Programming
extract the information encoded in biological Inductive logic programming (ILP) is a research area
databases and use the available data to predict formed at the intersection of ML and Logic Programming.
meaningful biological properties. ILP systems develop predicate descriptions from
2. ML approaches can adjust their internal structure to observations and background knowledge. There are three
produce correct outputs for a large number of sample main elements in an ILP learning system: observations,
inputs. They can thus constrain their input/output background knowledge, and hypothesis [5]. Each of these
function to approximate the implicit relationship in elements of ILP is a logic program. Fig.1 shows the general
the training examples [6]. scheme for ILP methods. Observations and background
3. ML approaches adapt themselves to new information knowledge are combined by an ILP program to form a
(training examples). This is important in systems hypothesis. A set of IF – THEN rules can then be derived
biology because new data are generated every day. from the hypothesis. For example:
The newly generated data might update the initial Hypothesis: fold('Four-helical up-and-down bundle',P) :-
learning hypotheses. helix(P,H1), length(H1,hi), position(P,H1,Pos),
ML thus provides efficient approaches to analyze interval(1 =< Pos =< 3), adjacent(P,H1,H2),
biological data. helix(P,H2).
Rule: The protein P has fold class ‘Four-helical up-and-
down bundle’ if it contains a long helix H1 at a secondary
V.MACHINE LEARNING APPLICATIONS IN SYSTEMS BIOLOGY structure position between 1 and 3, and H1 is followed by a
A variety of ML techniques can be used to solve most of second helix H2. [5]
the problems in systems biology. In a systems biology The rules are tested on additional data. If experimentation
context, ML is used to discover meaningful knowledge from leads to high confidence in the hypothesis validity the
existing biological databases and to present that knowledge hypothesis is added to the background knowledge.
in an understandable pattern. The tasks of ML in systems
biology can be divided into seven categories as shown in
Table 1 [5]. These techniques, operating individually or in
combination, can meet the various challenges in systems
biology.
TABLE I
MACHINE LEARNING APPLICATIONS IN SYSTEMS BIOLOGY [5]
Application Description
1 Classification Predicting an item’s class.
2 Forecasting Predicting a parameter
value.
3 Clustering Finding groups of items.
4 Description Describing a group. Figure 1. Scheme for ILP Methods [7]
5 Deviation Finding changes.
Detection ILP has been used for protein structure prediction.
6 Link Analysis Finding relationships. Muggleton et al. implemented ILP by separating the data set
7 Visualization Presenting data visually to of proteins into groups of the same type of domain structure
facilitate human discovery. (ex. α-type domains). This allowed the system to have a
more homogenous data set, thus allowing better prediction
Machine learning approaches to protein structure [7]. The ILP program used in this method was Golem. The
prediction and gene pathway discovery in are examined in basic algorithm was as follows:
the following sections. 1. Take a random sample of pairs of residues from the
training set. This represents a set of pairs of residues
A.Protein Structure Prediction
chosen randomly from the set of all residues in all
Proteins are the essence of life. The secondary structure of proteins represented.
protein consists of α-helices, β-strands and coils. The folding 2. Compute all the common properties for each pair of
of these secondary structure elements forms the unique 3D residues.
structure of a protein. A lot of useful information is contained 3. Convert the common properties into a rule that is true
in this 3D structure. However, predicting proteins’ structure for the residue pair under consideration.
is a central problem in bioinformatics. It is the bottleneck 4. Choose the rule for the best residue pair. For example,
between sequencing efforts and drug design. ML approaches choose the rule that predicts the most true α-helix
like Inductive Logic Programming can be used to predict residues while predicting less than a pre-defined
protein structure. threshold of non-α-helix residues from the training
set.
5. Take another sample of unpredicted residue pairs.
3. 3
6. Form rules which express the common properties of regulation and thus discover causal gene pathways [10]. The
the best pair together with each of the individual GEEVE system, shown in Fig.2, consists of two modules: the
residue pairs in the sample. causal Bayesian network update module, and the decision
7. Repeat steps 4-6 until no improvement in prediction is tree generation and evaluation module.
produced.
The algorithm uses the best rule to eliminate a set of
predicted residues from the training set. The reduced
training set is then used to build up further rules. The process
terminates when no further rules can be found.
Golem produced an accuracy of about 81% when applied
to 16 proteins with α-type domains.
The disadvantage of ILP is the lack of probability in its
rules. Biological systems are characterized by a high degree
of uncertainty; thus, the hypotheses will have a higher
descriptive power if they incorporate a certain degree of
probability [5].
To date, ML methods cannot, by themselves, completely
describe a new protein’s structure; however, they can provide
valuable information regarding numerous structural
attributes.
B.Gene Pathway Discovery
Systems biology seeks to discover causal relationships
among a large number of genes and other cellular Figure 2: The GEEVE system [10]
2)Causal Bayesian Networks
constituents. From a system-level point of view, the various
A Bayesian network is a directed, acyclic graph of nodes
interactions and control loops, which form a genetic network,
representing variables and arcs representing dependencies the
represent the basis upon which the vast complexity and
variables. A Bayesian network encodes the joint probability
flexibility of life processes emerges.
distribution over all the variables. The joint distribution of a
ML techniques like clustering, Bayesian networks and
Bayesian network with N variables can be factored as
decision trees can be used to discover gene regulation
follows:
pathways.
P(x1, x2,…., xN| K) = , (1)
where xi is the state of variable Xi, πi is a joint state of the
1)Gene Clustering
parents of Xi, and K denotes background knowledge [10].
Clustering is a discovery approach that organizes and
Bayesian networks are capable of handling incomplete
identifies subsets of data and groups them into classes. Each
data sets, and are able to learn and predict the missing data.
class represents data with similar attributes. A derivative
They also provide models of causal influence. These
clustering algorithm can also be used to predict and explain
properties make Bayesian networks a promising tool for
complex data.
analyzing gene expression patterns.
Clustering algorithms are used to discover groups of genes
In the context of genetic pathway inference, each node of a
that show similar expression patterns under different
Bayesian network is assigned to a gene, and can assume the
experimental conditions. By this procedure, different families
different expression levels of this gene throughout the
of cell-cycle regulated genes in the bakers’ yeast,
training data. Each edge between the nodes (genes) denotes a
Saccharomyces Cerevisiae, have been identified [8].
regulatory relationship between them. If the edge is directed,
Gene clustering has several drawbacks. Firstly, the
as shown in Fig 3, it denotes that one gene controls the other.
assignment of genes to single clusters by most clustering
Fig.4 shows the feature graph trained for a genetic sub-
methods potentially prevents the exposure of complex
network of the bakers’ yeast.
interrelationships among genes. Secondly, clustering does not
always provide causal information. Genes sharing similar
expression profiles may not always share a function. Even
when similar expression levels correspond to similar
functions, the functional relationships among genes in a
cluster cannot be determined from the cluster data alone [9].
In contrast, a gene may be suppressed to allow another to be
expressed; thus, functionally related genes may be clustered
separately, blurring the existing relationship.
A system named GEEVE, introduced by Yoo and Cooper,
uses gene expression data to learn the models of gene
4. 4
non-standardized experimental techniques, etc. The
uncertainty associated with experiment-based research is
very high.
Despite these challenges, ML techniques have prompted
the success of systems biology in recent years. ML has
helped accelerate research in several areas of systems
biology including protein structure prediction, inference of
genetic and molecular networks, and gene-protein
interactions.
The author believes that systems biology will continue to
benefit from ML techniques in coming years.
Figure 3: The structure of a causal Bayesian network that represents a portion REFERENCES
of a hypothetical gene regulation pathway [10]
[1] Kitano, H.,”Looking beyond the details: a rise in system-
oriented approaches in genetics and molecular biology”,
Curr. Genet., Vol. 41(1), 2002,pp.1-10
[2] R. Lathrop,” Intelligent Systems in Biology: Why the
Excitement?”, IEEE Intelligent Sys,Vol.16(6), 2001, pp.
8-13
[3] Luke, S. Hamahashi, S. Kyoda, K. Ueda, H., “Biology:
see it again-for the first time”, IEEE Intelligent Systems,
Vol. 13 (5), 1998, pp. 6-8.
[4] Hu, Y, Kibler, D, “Combinatorial motif analysis and
hypothesis generation on a genomic scale”,
Bioinformatics., Vol 16 (3), 2000;pp. 222-32
[5] Tan, A, Gilbert, G,”Machine Learning and its
Application to Bioinformatics: An Overview”,
www.brc.dcs.gla.ac.uk/ ~actan/publications.html),
Retrieved: Oct. 27, 2004
[6] Nilsson, N, “Introduction to Machine Learning”,
Figure 4: Genetic sub-networks of the bakers yeast. [11] unpublished,http://robotics.stanford.edu/people/nilsson/
mlbook.html,1996, Retrieved: Oct. 27, 2004
While Bayesian networks produce better results than rule- [7] Muggleton, S., King, R., Sternberg, M., “Using logic for
based learning methods, there is no clear explanation of the protein structure prediction”, Proceedings of the 25th
learning process. It is therefore hard to understand the results Hawaii Int. Conf. on System Sciences, IEEE Computer
and to interpret it into useful knowledge [5]. Society Press, 1992
[8] Spellman, P.T., “Comprehensive Identification of Cell
3)Decision Trees Cycle-regulated Genes of the Yeast Saccharomyces
The decision tree is a simple inductive learning system cerevisiae by Microarray Hybridization”, Molecular
that uses discrete-valued functions to estimate and classify Biology of the Cell, 1998, pp. 3273-3297.
the provided training set. The system is represented by a tree [9] Shatkay, H. Edwards, S. Boguski, M., “Information
whose internal nodes are tests (boolean decisions) and whose retrieval meets gene analysis”, IEEE Intelligent Systems,
leaf nodes are classes. The tree can make predictions about Vol. 17 (2), 2002, pp. 45- 53.
the probability of a particular case belonging to a particular [10] Yoo, C, Cooper, G.,”An Evaluation of a System that
class. Recommends Microarray Experiments to Perform to
Decision trees can be used to model gene perturbation in Discover Gene-Regulation Pathways”, Journal of
experiments. The GEEVE system, for example, builds and Artificial Intelligence in Medicine;Vol. 31(2), 2004,
pp.169-182.
evaluates a decision tree based on pair-wise gene
[11] Stetter, M, “Large-Scale Computational Modeling of
relationships. Thus, the effects on gene X when gene Y is
Genetic Regulatory Networks”, Artificial Intelligence
perturbed can be modeled [11]. Review 20, 2003, pp. 75–93
The drawbacks of decision trees are over-fitting of data [12] National Center for Biotechnology Information:
and overlapping in the classes. These and other factors make GenBank
decision trees difficult to optimize. Overview,www.ncbi.nlm.nih.gov/Genbank/GenbankOve
rview.html, Retrieved: Oct 27, 2004
VI.CONCLUSION
Since ML primarily deals with the extraction of knowledge
from data, redundancy of data is an important issue facing
ML. The quality of biological data is usually compromised
by experimental errors, wrong interpretation by biologists,