SlideShare a Scribd company logo
1 of 4
1




                          Machine Learning Applications in
                                 Systems Biology
                                           Natasha Alves, M.A.Sc. Candidate, ECE
                                                                  inherent redundancy in many pathways and feedback
   Abstract— Recent advances in high-throughput                   systems.
technologies have led to an immense flow of biological               A lot of useful and important information about biological
data. Extracting the information hidden in the ever-              systems is hidden in high volumes of experimental data. For
expanding biological databases has been an obstacle in the        instance, there are 37 billion bases of DNA in 32,000
progress of systems biology. Machine Learning has                 sequence records in GenBank alone (Feb. 2004)[12].
proved to be an efficient and inexpensive approach to             Analyzing high volumes of data to understand biological
organizing data; developing new tools to analyze data;            systems demands tedious experimentation and modern
and discovering new knowledge from data. This paper               computational technology. This is the grand challenge for
introduces Machine Learning techniques like inductive             systems biology in this era.
logic programming, clustering, Bayesian networks, and                An intelligent approach is needed to extract the hidden
decision trees in the context of their applications in            information from the data and to cope with the rapid rate of
systems biology. The shortcomings of these Machine                data deposition.
Learning techniques are also addressed.

  Index Terms—Artificial Intelligence, Bayesian Networks,
Clustering, Decision Trees, Inductive Logic Programming,                               III.MACHINE LEARNING
Machine Learning, Systems Biology.                                   Machine Learning (ML) is the capability of computer
                                                                  algorithms to improve automatically through experience (i.e.
                                                                  the computer programs itself by seeing examples of the
                       I.INTRODUCTION                             behavior we want). ML approaches are ideally suited for
   Systems Biology is an in-depth, systems-level analysis of      domains characterized by the presence of large amounts of
biological systems grounded on the molecular level [1]. It is     data, noisy patterns and the absence of general theories [4].
different from other methods of biological study where the        The fundamental idea behind these approaches is to learn the
focus is on the characteristics of isolated parts of a cell or    theory automatically from the data through a process of
organism. Systems biology examines the structure and              inference and model fitting. A system that can learn from
dynamics of cellular and organism functions, and their            experience and improve its performance automatically could
interconnections and interrelationships. One ultimate goal of     serve as a tool for solving biological systems.
systems biology is to use the knowledge of the complete              The main goal of ML is to induce general functions from a
genome sequence and all proteins encoded by that genome to        specific training data set. The learning agent is given a set of
reconstruct the biological systems that are implied [2].          training examples, and it defines the hypothesis for them.
   The development of systems biology is driven by                The agent must search through the hypothesis space and
technology. Sophisticated computational techniques are            locate the best hypothesis when given the test set [5].
needed to analyze biological systems because of the                  Because ML is concerned with learning from data
complexity and dynamics involved. Machine Learning,               examples, it often uses a probabilistic approach.
which is an automatic and intelligent learning technique, has
for long been used to discover meaningful associations
between proteins, and for scientific hypothesis formation [3].    IV.OVERCOMING THE CHALLENGES IN SYSTEMS BIOLOGY
   The aim of this paper is to introduce Machine Learning           ML approaches have gained popularity in systems biology.
techniques in the context of their application in systems         The characteristics of ML that make it well suited for
biology.                                                          systems biology are:
                                                                    1. Many problems in biological systems are not well
                                                                         defined, but have a lot of experimental data. ML is
                II.CHALLENGES IN SYSTEMS BIOLOGY                         useful when the structure of the task is not well
  Much of our failure to fully understand biological systems             understood but the task can be characterized by a data
has been due to their size and complexity. Systems biology               set with strong statistic regularity. While input/output
emphasizes on large-scale discovery of the interactions of               pairs can be easily specified, the relationship between
genes, proteins, and other cell elements. It is confronted with          the inputs and outputs are often unknown (e.g. the
dynamic biological responses, a huge number of interactions,             protein folding mechanism). ML approaches can
                                                                         extract relationships and correlations hidden under
  
    Manuscript received November 1, 2004
2

       large volumes of data (data mining). It could thus               1)Inductive Logic Programming
       extract the information encoded in biological                   Inductive logic programming (ILP) is a research area
       databases and use the available data to predict              formed at the intersection of ML and Logic Programming.
       meaningful biological properties.                            ILP systems develop predicate descriptions from
  2. ML approaches can adjust their internal structure to           observations and background knowledge. There are three
       produce correct outputs for a large number of sample         main elements in an ILP learning system: observations,
       inputs. They can thus constrain their input/output           background knowledge, and hypothesis [5]. Each of these
       function to approximate the implicit relationship in         elements of ILP is a logic program. Fig.1 shows the general
       the training examples [6].                                   scheme for ILP methods. Observations and background
  3. ML approaches adapt themselves to new information              knowledge are combined by an ILP program to form a
       (training examples). This is important in systems            hypothesis. A set of IF – THEN rules can then be derived
       biology because new data are generated every day.            from the hypothesis. For example:
       The newly generated data might update the initial               Hypothesis: fold('Four-helical up-and-down bundle',P) :-
       learning hypotheses.                                              helix(P,H1), length(H1,hi), position(P,H1,Pos),
  ML thus provides efficient approaches to analyze                       interval(1 =< Pos =< 3), adjacent(P,H1,H2),
biological data.                                                       helix(P,H2).
                                                                       Rule: The protein P has fold class ‘Four-helical up-and-
                                                                    down bundle’ if it contains a long helix H1 at a secondary
     V.MACHINE LEARNING APPLICATIONS IN SYSTEMS BIOLOGY             structure position between 1 and 3, and H1 is followed by a
  A variety of ML techniques can be used to solve most of           second helix H2. [5]
the problems in systems biology. In a systems biology                  The rules are tested on additional data. If experimentation
context, ML is used to discover meaningful knowledge from           leads to high confidence in the hypothesis validity the
existing biological databases and to present that knowledge         hypothesis is added to the background knowledge.
in an understandable pattern. The tasks of ML in systems
biology can be divided into seven categories as shown in
Table 1 [5]. These techniques, operating individually or in
combination, can meet the various challenges in systems
biology.

                                TABLE I
           MACHINE LEARNING APPLICATIONS IN SYSTEMS BIOLOGY   [5]
            Application                        Description
 1    Classification              Predicting an item’s class.
 2    Forecasting                 Predicting     a   parameter
                                  value.
 3    Clustering                  Finding groups of items.
 4    Description                 Describing a group.               Figure 1. Scheme for ILP Methods [7]
 5    Deviation                   Finding changes.
      Detection                                                        ILP has been used for protein structure prediction.
 6    Link Analysis               Finding relationships.            Muggleton et al. implemented ILP by separating the data set
 7    Visualization               Presenting data visually to       of proteins into groups of the same type of domain structure
                                  facilitate human discovery.       (ex. α-type domains). This allowed the system to have a
                                                                    more homogenous data set, thus allowing better prediction
  Machine learning approaches to protein structure                  [7]. The ILP program used in this method was Golem. The
prediction and gene pathway discovery in are examined in            basic algorithm was as follows:
the following sections.                                                1. Take a random sample of pairs of residues from the
                                                                           training set. This represents a set of pairs of residues
  A.Protein Structure Prediction
                                                                           chosen randomly from the set of all residues in all
   Proteins are the essence of life. The secondary structure of            proteins represented.
protein consists of α-helices, β-strands and coils. The folding        2. Compute all the common properties for each pair of
of these secondary structure elements forms the unique 3D                  residues.
structure of a protein. A lot of useful information is contained       3. Convert the common properties into a rule that is true
in this 3D structure. However, predicting proteins’ structure              for the residue pair under consideration.
is a central problem in bioinformatics. It is the bottleneck           4. Choose the rule for the best residue pair. For example,
between sequencing efforts and drug design. ML approaches                  choose the rule that predicts the most true α-helix
like Inductive Logic Programming can be used to predict                    residues while predicting less than a pre-defined
protein structure.                                                         threshold of non-α-helix residues from the training
                                                                           set.
                                                                       5. Take another sample of unpredicted residue pairs.
3

  6.    Form rules which express the common properties of          regulation and thus discover causal gene pathways [10]. The
        the best pair together with each of the individual         GEEVE system, shown in Fig.2, consists of two modules: the
        residue pairs in the sample.                               causal Bayesian network update module, and the decision
   7. Repeat steps 4-6 until no improvement in prediction is       tree generation and evaluation module.
        produced.
   The algorithm uses the best rule to eliminate a set of
predicted residues from the training set. The reduced
training set is then used to build up further rules. The process
terminates when no further rules can be found.
   Golem produced an accuracy of about 81% when applied
to 16 proteins with α-type domains.
   The disadvantage of ILP is the lack of probability in its
rules. Biological systems are characterized by a high degree
of uncertainty; thus, the hypotheses will have a higher
descriptive power if they incorporate a certain degree of
probability [5].
   To date, ML methods cannot, by themselves, completely
describe a new protein’s structure; however, they can provide
valuable information regarding numerous structural
attributes.

  B.Gene Pathway Discovery
   Systems biology seeks to discover causal relationships
among a large number of genes and other cellular                       Figure 2: The GEEVE system [10]
                                                                       2)Causal Bayesian Networks
constituents. From a system-level point of view, the various
                                                                      A Bayesian network is a directed, acyclic graph of nodes
interactions and control loops, which form a genetic network,
                                                                   representing variables and arcs representing dependencies the
represent the basis upon which the vast complexity and
                                                                   variables. A Bayesian network encodes the joint probability
flexibility of life processes emerges.
                                                                   distribution over all the variables. The joint distribution of a
   ML techniques like clustering, Bayesian networks and
                                                                   Bayesian network with N variables can be factored as
decision trees can be used to discover gene regulation
                                                                   follows:
pathways.
                                                                   P(x1, x2,…., xN| K) = ,                                      (1)
                                                                   where xi is the state of variable Xi, πi is a joint state of the
    1)Gene Clustering
                                                                   parents of Xi, and K denotes background knowledge [10].
   Clustering is a discovery approach that organizes and
                                                                      Bayesian networks are capable of handling incomplete
identifies subsets of data and groups them into classes. Each
                                                                   data sets, and are able to learn and predict the missing data.
class represents data with similar attributes. A derivative
                                                                   They also provide models of causal influence. These
clustering algorithm can also be used to predict and explain
                                                                   properties make Bayesian networks a promising tool for
complex data.
                                                                   analyzing gene expression patterns.
   Clustering algorithms are used to discover groups of genes
                                                                      In the context of genetic pathway inference, each node of a
that show similar expression patterns under different
                                                                   Bayesian network is assigned to a gene, and can assume the
experimental conditions. By this procedure, different families
                                                                   different expression levels of this gene throughout the
of cell-cycle regulated genes in the bakers’ yeast,
                                                                   training data. Each edge between the nodes (genes) denotes a
Saccharomyces Cerevisiae, have been identified [8].
                                                                   regulatory relationship between them. If the edge is directed,
   Gene clustering has several drawbacks. Firstly, the
                                                                   as shown in Fig 3, it denotes that one gene controls the other.
assignment of genes to single clusters by most clustering
                                                                   Fig.4 shows the feature graph trained for a genetic sub-
methods potentially prevents the exposure of complex
                                                                   network of the bakers’ yeast.
interrelationships among genes. Secondly, clustering does not
always provide causal information. Genes sharing similar
expression profiles may not always share a function. Even
when similar expression levels correspond to similar
functions, the functional relationships among genes in a
cluster cannot be determined from the cluster data alone [9].
In contrast, a gene may be suppressed to allow another to be
expressed; thus, functionally related genes may be clustered
separately, blurring the existing relationship.

  A system named GEEVE, introduced by Yoo and Cooper,
uses gene expression data to learn the models of gene
4

                                                                                 non-standardized experimental techniques, etc. The
                                                                                 uncertainty associated with experiment-based research is
                                                                                 very high.
                                                                                    Despite these challenges, ML techniques have prompted
                                                                                 the success of systems biology in recent years. ML has
                                                                                 helped accelerate research in several areas of systems
                                                                                 biology including protein structure prediction, inference of
                                                                                 genetic and molecular networks, and gene-protein
                                                                                 interactions.
                                                                                    The author believes that systems biology will continue to
                                                                                 benefit from ML techniques in coming years.


Figure 3: The structure of a causal Bayesian network that represents a portion                             REFERENCES
of a hypothetical gene regulation pathway [10]
                                                                                 [1] Kitano, H.,”Looking beyond the details: a rise in system-
                                                                                      oriented approaches in genetics and molecular biology”,
                                                                                      Curr. Genet., Vol. 41(1), 2002,pp.1-10
                                                                                 [2] R. Lathrop,” Intelligent Systems in Biology: Why the
                                                                                      Excitement?”, IEEE Intelligent Sys,Vol.16(6), 2001, pp.
                                                                                      8-13
                                                                                 [3] Luke, S. Hamahashi, S. Kyoda, K. Ueda, H., “Biology:
                                                                                      see it again-for the first time”, IEEE Intelligent Systems,
                                                                                      Vol. 13 (5), 1998, pp. 6-8.
                                                                                 [4] Hu, Y, Kibler, D, “Combinatorial motif analysis and
                                                                                      hypothesis generation on a genomic scale”,
                                                                                      Bioinformatics., Vol 16 (3), 2000;pp. 222-32
                                                                                 [5] Tan, A, Gilbert, G,”Machine Learning and its
                                                                                      Application to Bioinformatics: An Overview”,
                                                                                      www.brc.dcs.gla.ac.uk/ ~actan/publications.html),
                                                                                      Retrieved: Oct. 27, 2004
                                                                                 [6] Nilsson, N, “Introduction to Machine Learning”,
  Figure 4: Genetic sub-networks of the bakers yeast. [11]                            unpublished,http://robotics.stanford.edu/people/nilsson/
                                                                                      mlbook.html,1996, Retrieved: Oct. 27, 2004
   While Bayesian networks produce better results than rule-                     [7] Muggleton, S., King, R., Sternberg, M., “Using logic for
based learning methods, there is no clear explanation of the                          protein structure prediction”, Proceedings of the 25th
learning process. It is therefore hard to understand the results                      Hawaii Int. Conf. on System Sciences, IEEE Computer
and to interpret it into useful knowledge [5].                                        Society Press, 1992
                                                                                 [8] Spellman, P.T., “Comprehensive Identification of Cell
    3)Decision Trees                                                                  Cycle-regulated Genes of the Yeast Saccharomyces
   The decision tree is a simple inductive learning system                            cerevisiae by Microarray Hybridization”, Molecular
that uses discrete-valued functions to estimate and classify                          Biology of the Cell, 1998, pp. 3273-3297.
the provided training set. The system is represented by a tree                   [9] Shatkay, H. Edwards, S. Boguski, M., “Information
whose internal nodes are tests (boolean decisions) and whose                          retrieval meets gene analysis”, IEEE Intelligent Systems,
leaf nodes are classes. The tree can make predictions about                           Vol. 17 (2), 2002, pp. 45- 53.
the probability of a particular case belonging to a particular                   [10] Yoo, C, Cooper, G.,”An Evaluation of a System that
class.                                                                                Recommends Microarray Experiments to Perform to
   Decision trees can be used to model gene perturbation in                           Discover Gene-Regulation Pathways”, Journal of
experiments. The GEEVE system, for example, builds and                                Artificial Intelligence in Medicine;Vol. 31(2), 2004,
                                                                                      pp.169-182.
evaluates a decision tree based on pair-wise gene
                                                                                 [11] Stetter, M, “Large-Scale Computational Modeling of
relationships. Thus, the effects on gene X when gene Y is
                                                                                      Genetic Regulatory Networks”, Artificial Intelligence
perturbed can be modeled [11].                                                        Review 20, 2003, pp. 75–93
   The drawbacks of decision trees are over-fitting of data                      [12] National Center for Biotechnology Information:
and overlapping in the classes. These and other factors make                          GenBank
decision trees difficult to optimize.                                                 Overview,www.ncbi.nlm.nih.gov/Genbank/GenbankOve
                                                                                      rview.html, Retrieved: Oct 27, 2004
                       VI.CONCLUSION
   Since ML primarily deals with the extraction of knowledge
from data, redundancy of data is an important issue facing
ML. The quality of biological data is usually compromised
by experimental errors, wrong interpretation by biologists,

More Related Content

What's hot

Berlin center for genome based bioinformatics koch05
Berlin center for genome based bioinformatics   koch05Berlin center for genome based bioinformatics   koch05
Berlin center for genome based bioinformatics koch05Slava Karpov
 
Knowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningKnowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningjaumebp
 
The physics behind systems biology
The physics behind systems biologyThe physics behind systems biology
The physics behind systems biologyImam Rosadi
 
Automatically Generating Wikipedia Articles: A Structure-Aware Approach
Automatically Generating Wikipedia Articles:  A Structure-Aware ApproachAutomatically Generating Wikipedia Articles:  A Structure-Aware Approach
Automatically Generating Wikipedia Articles: A Structure-Aware ApproachGeorge Ang
 
Computational of Bioinformatics
Computational of BioinformaticsComputational of Bioinformatics
Computational of Bioinformaticsijtsrd
 
Performance Evaluation of Neural Classifiers Through Confusion Matrices To Di...
Performance Evaluation of Neural Classifiers Through Confusion Matrices To Di...Performance Evaluation of Neural Classifiers Through Confusion Matrices To Di...
Performance Evaluation of Neural Classifiers Through Confusion Matrices To Di...Waqas Tariq
 
Poster Semantic data integration proof of concept
Poster Semantic data integration proof of conceptPoster Semantic data integration proof of concept
Poster Semantic data integration proof of conceptNicolas Bertrand
 
Applying Soft Computing Techniques in Information Retrieval
Applying Soft Computing Techniques in Information RetrievalApplying Soft Computing Techniques in Information Retrieval
Applying Soft Computing Techniques in Information RetrievalIJAEMSJORNAL
 
USING ARTIFICIAL NEURAL NETWORK IN DIAGNOSIS OF THYROID DISEASE: A CASE STUDY
USING ARTIFICIAL NEURAL NETWORK IN DIAGNOSIS OF THYROID DISEASE: A CASE STUDYUSING ARTIFICIAL NEURAL NETWORK IN DIAGNOSIS OF THYROID DISEASE: A CASE STUDY
USING ARTIFICIAL NEURAL NETWORK IN DIAGNOSIS OF THYROID DISEASE: A CASE STUDYijcsa
 
A scenario based approach for dealing with
A scenario based approach for dealing withA scenario based approach for dealing with
A scenario based approach for dealing withijcsa
 
Bibliography (Microsoft Word, 61k)
Bibliography (Microsoft Word, 61k)Bibliography (Microsoft Word, 61k)
Bibliography (Microsoft Word, 61k)butest
 
IRJET- Image Classification using Deep Learning Neural Networks for Brain...
IRJET-  	  Image Classification using Deep Learning Neural Networks for Brain...IRJET-  	  Image Classification using Deep Learning Neural Networks for Brain...
IRJET- Image Classification using Deep Learning Neural Networks for Brain...IRJET Journal
 
NanoAgents: Molecular Docking Using Multi-Agent Technology
NanoAgents: Molecular Docking Using Multi-Agent TechnologyNanoAgents: Molecular Docking Using Multi-Agent Technology
NanoAgents: Molecular Docking Using Multi-Agent TechnologyCSCJournals
 
Application and Implementation of different deep learning
Application and Implementation of different deep learningApplication and Implementation of different deep learning
Application and Implementation of different deep learningJIEJackyZOUChou
 
Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learni...
Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learni...Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learni...
Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learni...Hakka Labs
 
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...Dmitry Grapov
 

What's hot (20)

Berlin center for genome based bioinformatics koch05
Berlin center for genome based bioinformatics   koch05Berlin center for genome based bioinformatics   koch05
Berlin center for genome based bioinformatics koch05
 
Knowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningKnowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learning
 
The physics behind systems biology
The physics behind systems biologyThe physics behind systems biology
The physics behind systems biology
 
DR KL CV v5
DR KL CV v5DR KL CV v5
DR KL CV v5
 
evolutionary game theory presentation
evolutionary game theory presentationevolutionary game theory presentation
evolutionary game theory presentation
 
Automatically Generating Wikipedia Articles: A Structure-Aware Approach
Automatically Generating Wikipedia Articles:  A Structure-Aware ApproachAutomatically Generating Wikipedia Articles:  A Structure-Aware Approach
Automatically Generating Wikipedia Articles: A Structure-Aware Approach
 
Computational of Bioinformatics
Computational of BioinformaticsComputational of Bioinformatics
Computational of Bioinformatics
 
Performance Evaluation of Neural Classifiers Through Confusion Matrices To Di...
Performance Evaluation of Neural Classifiers Through Confusion Matrices To Di...Performance Evaluation of Neural Classifiers Through Confusion Matrices To Di...
Performance Evaluation of Neural Classifiers Through Confusion Matrices To Di...
 
Poster Semantic data integration proof of concept
Poster Semantic data integration proof of conceptPoster Semantic data integration proof of concept
Poster Semantic data integration proof of concept
 
Applying Soft Computing Techniques in Information Retrieval
Applying Soft Computing Techniques in Information RetrievalApplying Soft Computing Techniques in Information Retrieval
Applying Soft Computing Techniques in Information Retrieval
 
USING ARTIFICIAL NEURAL NETWORK IN DIAGNOSIS OF THYROID DISEASE: A CASE STUDY
USING ARTIFICIAL NEURAL NETWORK IN DIAGNOSIS OF THYROID DISEASE: A CASE STUDYUSING ARTIFICIAL NEURAL NETWORK IN DIAGNOSIS OF THYROID DISEASE: A CASE STUDY
USING ARTIFICIAL NEURAL NETWORK IN DIAGNOSIS OF THYROID DISEASE: A CASE STUDY
 
A scenario based approach for dealing with
A scenario based approach for dealing withA scenario based approach for dealing with
A scenario based approach for dealing with
 
Research Paper - Vaibhav
Research Paper - VaibhavResearch Paper - Vaibhav
Research Paper - Vaibhav
 
Bibliography (Microsoft Word, 61k)
Bibliography (Microsoft Word, 61k)Bibliography (Microsoft Word, 61k)
Bibliography (Microsoft Word, 61k)
 
IRJET- Image Classification using Deep Learning Neural Networks for Brain...
IRJET-  	  Image Classification using Deep Learning Neural Networks for Brain...IRJET-  	  Image Classification using Deep Learning Neural Networks for Brain...
IRJET- Image Classification using Deep Learning Neural Networks for Brain...
 
NanoAgents: Molecular Docking Using Multi-Agent Technology
NanoAgents: Molecular Docking Using Multi-Agent TechnologyNanoAgents: Molecular Docking Using Multi-Agent Technology
NanoAgents: Molecular Docking Using Multi-Agent Technology
 
Application and Implementation of different deep learning
Application and Implementation of different deep learningApplication and Implementation of different deep learning
Application and Implementation of different deep learning
 
Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learni...
Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learni...Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learni...
Intuidex - To be or not to be iid by William M. Pottenger (NYC Machine Learni...
 
Biological Network Inference via Gaussian Graphical Models
Biological Network Inference via Gaussian Graphical ModelsBiological Network Inference via Gaussian Graphical Models
Biological Network Inference via Gaussian Graphical Models
 
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
 

Viewers also liked

Table 1
Table 1Table 1
Table 1butest
 
"PhD in Information and Communication Technology"
"PhD in Information and Communication Technology""PhD in Information and Communication Technology"
"PhD in Information and Communication Technology"butest
 
Full text
Full textFull text
Full textbutest
 
2005年EI收录浙江财经学院论文7篇
2005年EI收录浙江财经学院论文7篇2005年EI收录浙江财经学院论文7篇
2005年EI收录浙江财经学院论文7篇butest
 
chorales.doc
chorales.docchorales.doc
chorales.docbutest
 
Mayank bhutoria curriculum vitae october 2008 Mayank Bhutoria ...
Mayank bhutoria curriculum vitae october 2008 Mayank Bhutoria ...Mayank bhutoria curriculum vitae october 2008 Mayank Bhutoria ...
Mayank bhutoria curriculum vitae october 2008 Mayank Bhutoria ...butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 
PowerPoint
PowerPointPowerPoint
PowerPointbutest
 

Viewers also liked (8)

Table 1
Table 1Table 1
Table 1
 
"PhD in Information and Communication Technology"
"PhD in Information and Communication Technology""PhD in Information and Communication Technology"
"PhD in Information and Communication Technology"
 
Full text
Full textFull text
Full text
 
2005年EI收录浙江财经学院论文7篇
2005年EI收录浙江财经学院论文7篇2005年EI收录浙江财经学院论文7篇
2005年EI收录浙江财经学院论文7篇
 
chorales.doc
chorales.docchorales.doc
chorales.doc
 
Mayank bhutoria curriculum vitae october 2008 Mayank Bhutoria ...
Mayank bhutoria curriculum vitae october 2008 Mayank Bhutoria ...Mayank bhutoria curriculum vitae october 2008 Mayank Bhutoria ...
Mayank bhutoria curriculum vitae october 2008 Mayank Bhutoria ...
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 
PowerPoint
PowerPointPowerPoint
PowerPoint
 

Similar to NatashaBME1450.doc

I NTRODUCTION.doc
I NTRODUCTION.docI NTRODUCTION.doc
I NTRODUCTION.docbutest
 
An approach for self creating software code in bionets with artificial embryo...
An approach for self creating software code in bionets with artificial embryo...An approach for self creating software code in bionets with artificial embryo...
An approach for self creating software code in bionets with artificial embryo...eSAT Publishing House
 
Annotation of SBML Models Through Rule-Based Semantic Integration
Annotation of SBML Models Through Rule-Based Semantic IntegrationAnnotation of SBML Models Through Rule-Based Semantic Integration
Annotation of SBML Models Through Rule-Based Semantic IntegrationAllyson Lister
 
Advanced Systems Biology Methods in Drug Discovery
Advanced Systems Biology Methods in Drug DiscoveryAdvanced Systems Biology Methods in Drug Discovery
Advanced Systems Biology Methods in Drug DiscoveryMikel Txopitea Elorriaga
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...CSCJournals
 
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...Natalio Krasnogor
 
LECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSLECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSMSCW Mysore
 
Java tutorial: Programmatic Access to Molecular Interactions
Java tutorial: Programmatic Access to Molecular InteractionsJava tutorial: Programmatic Access to Molecular Interactions
Java tutorial: Programmatic Access to Molecular InteractionsRafael C. Jimenez
 
Aspect oriented a candidate for neural networks and evolvable software
Aspect oriented a candidate for neural networks and evolvable softwareAspect oriented a candidate for neural networks and evolvable software
Aspect oriented a candidate for neural networks and evolvable softwareLinchuan Wang
 
Session ii g2 overview chemical modeling mmc
Session ii g2 overview chemical modeling mmcSession ii g2 overview chemical modeling mmc
Session ii g2 overview chemical modeling mmcUSD Bioinformatics
 
Unveiling the role of network and systems biology in drug discovery
Unveiling the role of network and systems biology in drug discoveryUnveiling the role of network and systems biology in drug discovery
Unveiling the role of network and systems biology in drug discoverychengcheng zhou
 
An Essay Concerning Human Understanding Of Genetic Programming
An Essay Concerning Human Understanding Of Genetic ProgrammingAn Essay Concerning Human Understanding Of Genetic Programming
An Essay Concerning Human Understanding Of Genetic ProgrammingJennifer Roman
 
Technology R&D Theme 3: Multi-scale Network Representations
Technology R&D Theme 3: Multi-scale Network RepresentationsTechnology R&D Theme 3: Multi-scale Network Representations
Technology R&D Theme 3: Multi-scale Network RepresentationsAlexander Pico
 
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLSTWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLSIJDKP
 
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...IJNSA Journal
 
Agent-based and Chemical-inspired Approaches for Multicellular Models
Agent-based and Chemical-inspired Approaches for Multicellular ModelsAgent-based and Chemical-inspired Approaches for Multicellular Models
Agent-based and Chemical-inspired Approaches for Multicellular ModelsAndrea Omicini
 
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...ijsc
 
Project report: Investigating the effect of cellular objectives on genome-sca...
Project report: Investigating the effect of cellular objectives on genome-sca...Project report: Investigating the effect of cellular objectives on genome-sca...
Project report: Investigating the effect of cellular objectives on genome-sca...Jarle Pahr
 
Introduction to systems medicine
Introduction to systems medicineIntroduction to systems medicine
Introduction to systems medicineimprovemed
 

Similar to NatashaBME1450.doc (20)

I NTRODUCTION.doc
I NTRODUCTION.docI NTRODUCTION.doc
I NTRODUCTION.doc
 
PhDc exam presentation
PhDc exam presentationPhDc exam presentation
PhDc exam presentation
 
An approach for self creating software code in bionets with artificial embryo...
An approach for self creating software code in bionets with artificial embryo...An approach for self creating software code in bionets with artificial embryo...
An approach for self creating software code in bionets with artificial embryo...
 
Annotation of SBML Models Through Rule-Based Semantic Integration
Annotation of SBML Models Through Rule-Based Semantic IntegrationAnnotation of SBML Models Through Rule-Based Semantic Integration
Annotation of SBML Models Through Rule-Based Semantic Integration
 
Advanced Systems Biology Methods in Drug Discovery
Advanced Systems Biology Methods in Drug DiscoveryAdvanced Systems Biology Methods in Drug Discovery
Advanced Systems Biology Methods in Drug Discovery
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...
 
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...
 
LECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSLECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICS
 
Java tutorial: Programmatic Access to Molecular Interactions
Java tutorial: Programmatic Access to Molecular InteractionsJava tutorial: Programmatic Access to Molecular Interactions
Java tutorial: Programmatic Access to Molecular Interactions
 
Aspect oriented a candidate for neural networks and evolvable software
Aspect oriented a candidate for neural networks and evolvable softwareAspect oriented a candidate for neural networks and evolvable software
Aspect oriented a candidate for neural networks and evolvable software
 
Session ii g2 overview chemical modeling mmc
Session ii g2 overview chemical modeling mmcSession ii g2 overview chemical modeling mmc
Session ii g2 overview chemical modeling mmc
 
Unveiling the role of network and systems biology in drug discovery
Unveiling the role of network and systems biology in drug discoveryUnveiling the role of network and systems biology in drug discovery
Unveiling the role of network and systems biology in drug discovery
 
An Essay Concerning Human Understanding Of Genetic Programming
An Essay Concerning Human Understanding Of Genetic ProgrammingAn Essay Concerning Human Understanding Of Genetic Programming
An Essay Concerning Human Understanding Of Genetic Programming
 
Technology R&D Theme 3: Multi-scale Network Representations
Technology R&D Theme 3: Multi-scale Network RepresentationsTechnology R&D Theme 3: Multi-scale Network Representations
Technology R&D Theme 3: Multi-scale Network Representations
 
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLSTWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
 
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
 
Agent-based and Chemical-inspired Approaches for Multicellular Models
Agent-based and Chemical-inspired Approaches for Multicellular ModelsAgent-based and Chemical-inspired Approaches for Multicellular Models
Agent-based and Chemical-inspired Approaches for Multicellular Models
 
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
 
Project report: Investigating the effect of cellular objectives on genome-sca...
Project report: Investigating the effect of cellular objectives on genome-sca...Project report: Investigating the effect of cellular objectives on genome-sca...
Project report: Investigating the effect of cellular objectives on genome-sca...
 
Introduction to systems medicine
Introduction to systems medicineIntroduction to systems medicine
Introduction to systems medicine
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
Download
DownloadDownload
Downloadbutest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
Download
DownloadDownload
Download
 

NatashaBME1450.doc

  • 1. 1 Machine Learning Applications in Systems Biology Natasha Alves, M.A.Sc. Candidate, ECE inherent redundancy in many pathways and feedback Abstract— Recent advances in high-throughput systems. technologies have led to an immense flow of biological A lot of useful and important information about biological data. Extracting the information hidden in the ever- systems is hidden in high volumes of experimental data. For expanding biological databases has been an obstacle in the instance, there are 37 billion bases of DNA in 32,000 progress of systems biology. Machine Learning has sequence records in GenBank alone (Feb. 2004)[12]. proved to be an efficient and inexpensive approach to Analyzing high volumes of data to understand biological organizing data; developing new tools to analyze data; systems demands tedious experimentation and modern and discovering new knowledge from data. This paper computational technology. This is the grand challenge for introduces Machine Learning techniques like inductive systems biology in this era. logic programming, clustering, Bayesian networks, and An intelligent approach is needed to extract the hidden decision trees in the context of their applications in information from the data and to cope with the rapid rate of systems biology. The shortcomings of these Machine data deposition. Learning techniques are also addressed. Index Terms—Artificial Intelligence, Bayesian Networks, Clustering, Decision Trees, Inductive Logic Programming, III.MACHINE LEARNING Machine Learning, Systems Biology. Machine Learning (ML) is the capability of computer algorithms to improve automatically through experience (i.e. the computer programs itself by seeing examples of the I.INTRODUCTION behavior we want). ML approaches are ideally suited for Systems Biology is an in-depth, systems-level analysis of domains characterized by the presence of large amounts of biological systems grounded on the molecular level [1]. It is data, noisy patterns and the absence of general theories [4]. different from other methods of biological study where the The fundamental idea behind these approaches is to learn the focus is on the characteristics of isolated parts of a cell or theory automatically from the data through a process of organism. Systems biology examines the structure and inference and model fitting. A system that can learn from dynamics of cellular and organism functions, and their experience and improve its performance automatically could interconnections and interrelationships. One ultimate goal of serve as a tool for solving biological systems. systems biology is to use the knowledge of the complete The main goal of ML is to induce general functions from a genome sequence and all proteins encoded by that genome to specific training data set. The learning agent is given a set of reconstruct the biological systems that are implied [2]. training examples, and it defines the hypothesis for them. The development of systems biology is driven by The agent must search through the hypothesis space and technology. Sophisticated computational techniques are locate the best hypothesis when given the test set [5]. needed to analyze biological systems because of the Because ML is concerned with learning from data complexity and dynamics involved. Machine Learning, examples, it often uses a probabilistic approach. which is an automatic and intelligent learning technique, has for long been used to discover meaningful associations between proteins, and for scientific hypothesis formation [3]. IV.OVERCOMING THE CHALLENGES IN SYSTEMS BIOLOGY The aim of this paper is to introduce Machine Learning ML approaches have gained popularity in systems biology. techniques in the context of their application in systems The characteristics of ML that make it well suited for biology. systems biology are: 1. Many problems in biological systems are not well defined, but have a lot of experimental data. ML is II.CHALLENGES IN SYSTEMS BIOLOGY useful when the structure of the task is not well Much of our failure to fully understand biological systems understood but the task can be characterized by a data has been due to their size and complexity. Systems biology set with strong statistic regularity. While input/output emphasizes on large-scale discovery of the interactions of pairs can be easily specified, the relationship between genes, proteins, and other cell elements. It is confronted with the inputs and outputs are often unknown (e.g. the dynamic biological responses, a huge number of interactions, protein folding mechanism). ML approaches can extract relationships and correlations hidden under  Manuscript received November 1, 2004
  • 2. 2 large volumes of data (data mining). It could thus 1)Inductive Logic Programming extract the information encoded in biological Inductive logic programming (ILP) is a research area databases and use the available data to predict formed at the intersection of ML and Logic Programming. meaningful biological properties. ILP systems develop predicate descriptions from 2. ML approaches can adjust their internal structure to observations and background knowledge. There are three produce correct outputs for a large number of sample main elements in an ILP learning system: observations, inputs. They can thus constrain their input/output background knowledge, and hypothesis [5]. Each of these function to approximate the implicit relationship in elements of ILP is a logic program. Fig.1 shows the general the training examples [6]. scheme for ILP methods. Observations and background 3. ML approaches adapt themselves to new information knowledge are combined by an ILP program to form a (training examples). This is important in systems hypothesis. A set of IF – THEN rules can then be derived biology because new data are generated every day. from the hypothesis. For example: The newly generated data might update the initial Hypothesis: fold('Four-helical up-and-down bundle',P) :- learning hypotheses. helix(P,H1), length(H1,hi), position(P,H1,Pos), ML thus provides efficient approaches to analyze interval(1 =< Pos =< 3), adjacent(P,H1,H2), biological data. helix(P,H2). Rule: The protein P has fold class ‘Four-helical up-and- down bundle’ if it contains a long helix H1 at a secondary V.MACHINE LEARNING APPLICATIONS IN SYSTEMS BIOLOGY structure position between 1 and 3, and H1 is followed by a A variety of ML techniques can be used to solve most of second helix H2. [5] the problems in systems biology. In a systems biology The rules are tested on additional data. If experimentation context, ML is used to discover meaningful knowledge from leads to high confidence in the hypothesis validity the existing biological databases and to present that knowledge hypothesis is added to the background knowledge. in an understandable pattern. The tasks of ML in systems biology can be divided into seven categories as shown in Table 1 [5]. These techniques, operating individually or in combination, can meet the various challenges in systems biology. TABLE I MACHINE LEARNING APPLICATIONS IN SYSTEMS BIOLOGY [5] Application Description 1 Classification Predicting an item’s class. 2 Forecasting Predicting a parameter value. 3 Clustering Finding groups of items. 4 Description Describing a group. Figure 1. Scheme for ILP Methods [7] 5 Deviation Finding changes. Detection ILP has been used for protein structure prediction. 6 Link Analysis Finding relationships. Muggleton et al. implemented ILP by separating the data set 7 Visualization Presenting data visually to of proteins into groups of the same type of domain structure facilitate human discovery. (ex. α-type domains). This allowed the system to have a more homogenous data set, thus allowing better prediction Machine learning approaches to protein structure [7]. The ILP program used in this method was Golem. The prediction and gene pathway discovery in are examined in basic algorithm was as follows: the following sections. 1. Take a random sample of pairs of residues from the training set. This represents a set of pairs of residues A.Protein Structure Prediction chosen randomly from the set of all residues in all Proteins are the essence of life. The secondary structure of proteins represented. protein consists of α-helices, β-strands and coils. The folding 2. Compute all the common properties for each pair of of these secondary structure elements forms the unique 3D residues. structure of a protein. A lot of useful information is contained 3. Convert the common properties into a rule that is true in this 3D structure. However, predicting proteins’ structure for the residue pair under consideration. is a central problem in bioinformatics. It is the bottleneck 4. Choose the rule for the best residue pair. For example, between sequencing efforts and drug design. ML approaches choose the rule that predicts the most true α-helix like Inductive Logic Programming can be used to predict residues while predicting less than a pre-defined protein structure. threshold of non-α-helix residues from the training set. 5. Take another sample of unpredicted residue pairs.
  • 3. 3 6. Form rules which express the common properties of regulation and thus discover causal gene pathways [10]. The the best pair together with each of the individual GEEVE system, shown in Fig.2, consists of two modules: the residue pairs in the sample. causal Bayesian network update module, and the decision 7. Repeat steps 4-6 until no improvement in prediction is tree generation and evaluation module. produced. The algorithm uses the best rule to eliminate a set of predicted residues from the training set. The reduced training set is then used to build up further rules. The process terminates when no further rules can be found. Golem produced an accuracy of about 81% when applied to 16 proteins with α-type domains. The disadvantage of ILP is the lack of probability in its rules. Biological systems are characterized by a high degree of uncertainty; thus, the hypotheses will have a higher descriptive power if they incorporate a certain degree of probability [5]. To date, ML methods cannot, by themselves, completely describe a new protein’s structure; however, they can provide valuable information regarding numerous structural attributes. B.Gene Pathway Discovery Systems biology seeks to discover causal relationships among a large number of genes and other cellular Figure 2: The GEEVE system [10] 2)Causal Bayesian Networks constituents. From a system-level point of view, the various A Bayesian network is a directed, acyclic graph of nodes interactions and control loops, which form a genetic network, representing variables and arcs representing dependencies the represent the basis upon which the vast complexity and variables. A Bayesian network encodes the joint probability flexibility of life processes emerges. distribution over all the variables. The joint distribution of a ML techniques like clustering, Bayesian networks and Bayesian network with N variables can be factored as decision trees can be used to discover gene regulation follows: pathways. P(x1, x2,…., xN| K) = , (1) where xi is the state of variable Xi, πi is a joint state of the 1)Gene Clustering parents of Xi, and K denotes background knowledge [10]. Clustering is a discovery approach that organizes and Bayesian networks are capable of handling incomplete identifies subsets of data and groups them into classes. Each data sets, and are able to learn and predict the missing data. class represents data with similar attributes. A derivative They also provide models of causal influence. These clustering algorithm can also be used to predict and explain properties make Bayesian networks a promising tool for complex data. analyzing gene expression patterns. Clustering algorithms are used to discover groups of genes In the context of genetic pathway inference, each node of a that show similar expression patterns under different Bayesian network is assigned to a gene, and can assume the experimental conditions. By this procedure, different families different expression levels of this gene throughout the of cell-cycle regulated genes in the bakers’ yeast, training data. Each edge between the nodes (genes) denotes a Saccharomyces Cerevisiae, have been identified [8]. regulatory relationship between them. If the edge is directed, Gene clustering has several drawbacks. Firstly, the as shown in Fig 3, it denotes that one gene controls the other. assignment of genes to single clusters by most clustering Fig.4 shows the feature graph trained for a genetic sub- methods potentially prevents the exposure of complex network of the bakers’ yeast. interrelationships among genes. Secondly, clustering does not always provide causal information. Genes sharing similar expression profiles may not always share a function. Even when similar expression levels correspond to similar functions, the functional relationships among genes in a cluster cannot be determined from the cluster data alone [9]. In contrast, a gene may be suppressed to allow another to be expressed; thus, functionally related genes may be clustered separately, blurring the existing relationship. A system named GEEVE, introduced by Yoo and Cooper, uses gene expression data to learn the models of gene
  • 4. 4 non-standardized experimental techniques, etc. The uncertainty associated with experiment-based research is very high. Despite these challenges, ML techniques have prompted the success of systems biology in recent years. ML has helped accelerate research in several areas of systems biology including protein structure prediction, inference of genetic and molecular networks, and gene-protein interactions. The author believes that systems biology will continue to benefit from ML techniques in coming years. Figure 3: The structure of a causal Bayesian network that represents a portion REFERENCES of a hypothetical gene regulation pathway [10] [1] Kitano, H.,”Looking beyond the details: a rise in system- oriented approaches in genetics and molecular biology”, Curr. Genet., Vol. 41(1), 2002,pp.1-10 [2] R. Lathrop,” Intelligent Systems in Biology: Why the Excitement?”, IEEE Intelligent Sys,Vol.16(6), 2001, pp. 8-13 [3] Luke, S. Hamahashi, S. Kyoda, K. Ueda, H., “Biology: see it again-for the first time”, IEEE Intelligent Systems, Vol. 13 (5), 1998, pp. 6-8. [4] Hu, Y, Kibler, D, “Combinatorial motif analysis and hypothesis generation on a genomic scale”, Bioinformatics., Vol 16 (3), 2000;pp. 222-32 [5] Tan, A, Gilbert, G,”Machine Learning and its Application to Bioinformatics: An Overview”, www.brc.dcs.gla.ac.uk/ ~actan/publications.html), Retrieved: Oct. 27, 2004 [6] Nilsson, N, “Introduction to Machine Learning”, Figure 4: Genetic sub-networks of the bakers yeast. [11] unpublished,http://robotics.stanford.edu/people/nilsson/ mlbook.html,1996, Retrieved: Oct. 27, 2004 While Bayesian networks produce better results than rule- [7] Muggleton, S., King, R., Sternberg, M., “Using logic for based learning methods, there is no clear explanation of the protein structure prediction”, Proceedings of the 25th learning process. It is therefore hard to understand the results Hawaii Int. Conf. on System Sciences, IEEE Computer and to interpret it into useful knowledge [5]. Society Press, 1992 [8] Spellman, P.T., “Comprehensive Identification of Cell 3)Decision Trees Cycle-regulated Genes of the Yeast Saccharomyces The decision tree is a simple inductive learning system cerevisiae by Microarray Hybridization”, Molecular that uses discrete-valued functions to estimate and classify Biology of the Cell, 1998, pp. 3273-3297. the provided training set. The system is represented by a tree [9] Shatkay, H. Edwards, S. Boguski, M., “Information whose internal nodes are tests (boolean decisions) and whose retrieval meets gene analysis”, IEEE Intelligent Systems, leaf nodes are classes. The tree can make predictions about Vol. 17 (2), 2002, pp. 45- 53. the probability of a particular case belonging to a particular [10] Yoo, C, Cooper, G.,”An Evaluation of a System that class. Recommends Microarray Experiments to Perform to Decision trees can be used to model gene perturbation in Discover Gene-Regulation Pathways”, Journal of experiments. The GEEVE system, for example, builds and Artificial Intelligence in Medicine;Vol. 31(2), 2004, pp.169-182. evaluates a decision tree based on pair-wise gene [11] Stetter, M, “Large-Scale Computational Modeling of relationships. Thus, the effects on gene X when gene Y is Genetic Regulatory Networks”, Artificial Intelligence perturbed can be modeled [11]. Review 20, 2003, pp. 75–93 The drawbacks of decision trees are over-fitting of data [12] National Center for Biotechnology Information: and overlapping in the classes. These and other factors make GenBank decision trees difficult to optimize. Overview,www.ncbi.nlm.nih.gov/Genbank/GenbankOve rview.html, Retrieved: Oct 27, 2004 VI.CONCLUSION Since ML primarily deals with the extraction of knowledge from data, redundancy of data is an important issue facing ML. The quality of biological data is usually compromised by experimental errors, wrong interpretation by biologists,