Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. 1 Artificial Intelligence in Systems Biology Hai Huang, Master Student, IBBME Huelsenbeck et al. found it very frustrating because it was Abstract — Systems biology extends the perspective from hidden in a flood of data [2]. To find the clues from the individual biological components to the system level. This voluminous data demands great experience, knowledge and development requires advanced modelling skills and data patience. Human error will inevitably have intense adverse processing techniques. Artificial intelligence could be the influence on the outcome. solution of this demand. Artificial Intelligence has showed its power in gene multiple alignment modelling and phylogenetic To understand system dynamic is to find out “How a likelihood inference. Its active learning algorithm will system behaves over time and under various conditions. [1]” accelerate the evolution of systems biology. A Biological system is far more complex than a mechanical Index term — Systems biology, system structure, system system; sometimes the same chemical messenger can carry dynamic, artificial intelligence, knowledge and reasoning, several signals simultaneously on different time scales [3]. machine learning. This brings a lot of confusion in understanding the roles of different parallel progresses and feedback mechanisms. I.INTRODUCTION Based on the knowledge of the structure and dynamic of a S YSTEMS biology is a system level understanding of biology [1], which was first introduced about 50 years ago. Compared to traditional biology, it is still in its infancy. particular system, control and design methods can be utilized to control the state and modify the property of the system [1]. For example, monitoring and controlling of the But as a emerging science, it recently shows its potential in side effects are major issues in the development of new dominating the developmental trend of molecular, genomic, drugs, especially gene-protein target drugs [4]. Difficulties and pharmacological researches. However, further arise here because the target genes produce large amounts of advancement in systems biology is not free of obstacles. proteins, some functions of which are unknown. Yet to Technologies from other scientific fields are demanded to control the therapeutic effects and design the drugs, it is assist the breakthroughs in systems biology. Artificial essential to identify the unknown functions and eliminate the Intelligence (AI) as one of the assistant tools, has undue functions. demonstrated its potential in overcoming the difficulties faced by systems biology. Some concepts of AI have already been applied in systems biology, while others are beginning III.ARTIFICIAL INTELLIGENCE to be utilized. Facing all these challenges, systems biology adopts a lot of The purpose of this paper is to provide an overview of techniques from other fields, such as System Engineering, systems biology and AI. The application of AI in systems Information Technology, and Control Theory. AI is a biology is introduced, and the trend of the future relatively new science incorporated in the development of relationships between AI and systems biology is discussed. systems biology. AI emerged as a new science category in the 1950s at the same time when the term “systems biology” was coined. It II.CHALLENGES IN SYSTEMS BIOLOGY refers to thinking and acting as a human being or at least The understanding of system-level biology is derived from thinking and acting rationally, rather than just imitating what the insight into the four key elements: system structure, a human being does [5]. If a system can only mimic a system dynamic, control method, and design method [1]. person’s actions, it is just a manipulator, but not actual Progress has been made in each of the above areas since the artificial-intelligence. emergence of systems biology, but every step of The Turing Test in 1950s was the first landmark of AI. In advancement was full of frustration. the Turing test, an interrogator was connected to a person or System structure is not a list of isolated components of a a machine via a terminal, which prevented him/her from cell or organism; it is more about the relationships between seeing his/her counterpart. His/her task was to find out these components [1]. However, from a biological view whether the counterpart was a machine by only asking point, to clearly describe those relationships is very questions [6]. If the machine could “fool” the interrogator, challenging. For example, the similarity of DNA between this machine system was considered an intelligent entity. The different species has a profound impact on evolutionary Turing Test demonstrated the possibility that a machine biology; however, in searching for this similarity, J. P. could act as a human being. Another well-known milestone of AI was Deep Blue. In Manuscript received October 21, 2003.  May 1997, IBM's Deep Blue Supercomputer played a match H. Huang is with the Institute of Bio-material and Bio-medical with the World Chess Champion, Garry Kasparov, and won Engineering, University of Toronto. Canada (corresponding author to provide the game [7]. It revealed that machines were able to compete e-mail: hai.huang@utoronto.ca). with human beings to some degree.
  2. 2. 2 Generally speaking, the scope of AI covers all human systems biology today. Examples of its application at two activities, such as observing the environment, judging different levels of systems biology are discussed in the successful behaviour, seeking the proper method, and following paragraphs. Some pioneering studies are also being adjusting knowledge while interacting with the target. It can done on the application of machine learning in systems be classified into four categories: problem solving, biology. However the application of autonomous planning, knowledge and reasoning, machine learning, autonomous communicating, perceiving and acting has not yet been seen. planning, communicating, perceiving and acting [5]. Problem solving is the basis of AI. It presents a A.Bayesian Inference of phylogeny topological view. Usually in the AI perspective, a problem like “How can one thing go from state A to state B?” could The idea that species are related is not new. More than be solved by searching an existing database based on one century ago, Darwin became one of the pioneers in the constraints and conditions. This search could be target area of evolutionary biology. These pioneers intended to reveal a systematic structure from a biological point of oriented, start-point oriented, or bidirectional. view. Just like the trend of biology nowadays, biological Knowledge and reasoning is to understand and identify a phylogeny is more like a bioinformatics science. A lot of successful behaviour in a complex environment. It is the key molecular data transform this question of the history of life to component of AI. Knowledge and reasoning play a crucial a statistical and computational problem. Many different role in dealing with partially observable environments. Based inferential methods were introduced into phylogenetic on logic, probability, and the statistics theory, two important analysis, seeking the relationship between different theories were developed. One is the Bayesian network; and biological classes. Among them, Bayesian inference, an the other is the Hidden Markov Model, both of which are important AI theory and application, is relatively new in this dominant in current AI. They will be discussed in detail in field, but it is a powerful tool for addressing a number of section IV of this paper. long-standing and complex questions in evolutionary Machine learning enables a system to adjust itself to the biology. Table 1 lists some Bayesian inference application in environment. Whether supervised or unsupervised, passive or the phylogeny perspective. active, machine learning is to improve the system’s ability to act in the future. It is now the most important trend in the Problem Bayesian approach development of AI. Find tree with maximum posterior probability; Inferring phylogeny evaluate features in common among the sampled Autonomous planning, communicating, perceiving and trees acting are the implementations of the thinking part of AI into Evaluating Evaluate clade probabilities; form credible set its acting part. They are the applications of problem solving, uncertainty in containing trees whose cumulative probability phylogenies sums to 0.95 knowledge and reasoning, and machine learning. Model substitution process on the codon and The above four aspects enable AI to become a very good calculate probability of being in purifying or tool to reduce human errors, improve efficiency, save time, Detecting selection positively selected class; sample substitutions and and derease costs, and thus allows it to be applied in count number of synonymous and nonsynonymous changes overcoming the difficulties faced by systems biology. Comparative Perform analysis on many trees, and weight results analyses by the probability that each tree is correct Use fossils as a calibration. Infer divergence times Divergence times IV.APPLICATION OF AI IN SYSTEMS BIOLOGY by using a strict or relaxed molecular clock Testing molecular Calculate Bayes factor for the clock versus no Systems biology and AI were developed parallel to each clock branch length restrictions other before the 1980s as two distinct disciplines. However in the past twenty years, the rapid technological development Table 1 Bayesian approach to problems in phylogeny has created the opportunities for AI to be applied in systems biology. The advancement in computer science and information technology allows AI to have more powerful computer platforms as its tool. At the same time, new Bayesian inference is to compute the posterior probability theoretical concepts and approaches in computer science distribution for a set of query variables over a Bayesian enhance the theoretical development of AI. On the other network, which is able to represent the dependencies among hand, new technologies such as gene Microarray have been variables and give a concise specification of any full joint brought into systems biology. These technologies create the probability distribution [5]. As a part of knowledge and opportunities to digitize the experimental results and improve reasoning in AI categories, this inference is to identify the the repeatability of tests. They also produce an enormous correct relationship between different elements. The basic amount of data. Therefore, new methods are highly expression of Bayesian theory is: demanded to process these data. AI has been acting as a useful tool in these situations. Of the four components of AI, problem solving is the basis In phylogeny, this expression is used to combine the prior of the other three aspects. Thus, its application in systems probability of a phylogeny (Pr[Tree]) with the likelihood biology is involved in the application of the other three (Pr[Data | Tree]) to produce a posterior probability aspects of AI. Being the best studied element of AI, distribution on trees (Pr[Tree | Data]). Inferences about the history of the group are based on the posterior probability of knowledge and reasoning has a relatively wide application in
  3. 3. 3 trees. The tree with the highest posterior probability might be chosen as the best estimate of phylogeny [2]. Huelsenbeck et al. implemented this approach by a numerical method MCMC (Markov chain Monte Carlo) of Bayesian inference. There were two important practical problems associated with the application of MCMC. One was the modelling assumption. A poorly fitted assumption would lead to a wrong inference. Their assumption was the general time reversible (GTR) model of DNA substitution in the analyses, which allowed each nucleotide change to have its own rate and the nucleotide bases to have different frequencies. It allowed rates to vary across sites either by assuming the randomness of the rate or by dividing the sites into several codon positions. Another problem was to determine how long to run a chain to obtain a good approximation of the posterior probabilities of trees. In some cases the MCMC algorithm would fail to converge. Eventually they identified convergence by a trial-error method. Based on a variant of MCMC called Metropolis-coupled MCMC, Huelsenbeck et al. deisgned a computer program [2]. They applied this program to four large phylogenetic Figure 1 Convergence of independent Markov Chain data. The smallest data set included 106 wingless sequences sampled from insects, and the largest included 357 atpB sequences sampled from plants. Figure 1 shows the posterior A particular challenge of gene-finding and functional probability of a clade condition on the observed DNA annotation is how to describe multiple alignments. Multiple sequences for two chains, each of them starting from alignments show the dynamic property of protein sequence. different random trees. The posterior probabilities of the Finding multiple alignments can be done in a laboratory with individual clade found in different chains are highly real “wet” experiments, which are very expensive and time correlated. There is no obvious correlation found cross the consuming. clades. This result proved that Bayesian inference could be a precise method in phylogenetic analysis. Huelsenbeck and his colleagues pointed out that Bayesian inference could be used as an important method in the study of Molecular Evolution, especially in the field of substitution patterns. Their next step was to construct a large tree / network for better understanding of the evolution of genome in the context of phylogeny. Huelsenbeck’s study implements an important AI theory - Bayesian inference to find the relationship between different species. This approach demonstrates that AI is able to recognize and build the structure of a complex bio-system, Figure 2 An example of a multiple alignment. such as an evolution tree. B.Hidden Markov Model (HMM) in Biopolymers To save cost and time, Amitai wanted to find the solutions from the existed public data (mainly genomic DNA, Hidden Markov Model is one of the most important messenger RNA, and their corresponding protein sequences), contributions of the Russian mathematician A.A. Markov. It which were in large amounts. Just in GenBank alone, there is a very influential modeling method in the AI knowledge were approximately 28,507,990,166 bases in 22,318,883 and reasoning category. HMM is a temporal probabilistic sequence records as of January 2003 [10]. It was impossible model, where the state of the process is described by a single for a human being to find out the hidden relationships from discrete random variable; this variable is a possible state in these billions of bases. HMM, as a model of AI, was then the real world [5]. The structure of HMM allows simple and utilized. There were three reasons for using HMM in elegant computation of all basic AI logical algorithms. HMM modelling proteins and genes. First of all, HMM had the is used to search for patterns and to detect phenomena in advantage of precise probabilistic modeling. Second, the uncharacterized data. It was first used in speech recognition experience gained from the same tools in speech recognition in the 1970s and 1980s. From the late 90s, some genomic could be utilized [8]. Third, some computer programs were researchers started to use HMM as an analysis tool. In the well developed to build and apply HMM. Among these year 2002, M. Amitai et al. tried to use HMM in the study of programs, there were a few focusing on the sequential gene-finding and functional annotation [8]. analysis of protein, such as HMMer and SAM [12, 13]. Figure 2 is a real example taken from the PDGF (platelet-
  4. 4. 4 derived growth factor) family. In position 17, half of it has an V.CONCLUSION amino acid; which could be proline or arginine in half-half Technologies of AI have been proven to be beneficial to chance. Another half in position 17 has no amino acid (called the development of systems biology. Problem solving is the deleted position) [8]. Although the statistical population is basis of AI, and its importance is represented in the relatively small, based on the knowledge of protein application of all the other aspects of AI in systems biology. evolution, the new member of the same family behaves Knowledge and reasoning is currently the most widely similar in the same position [11]. This similarity meets the applied. It helps in the identification of system structure as in assumption of HMM. Figure 3 is part of HMM a constructed the example of Bayesian Inference of phylogeny. It also form of the multiple alignments. Here state M16 is shows its value in understanding system dynamics, as corresponding to position 16. From this state, there is a 50% exemplified by the Hidden Markov Model in biopolymers. possibility to D17 (deleted position), and a 50% possibility to The machine learning algorithm starts to demonstrate its M17. M17 is a clustered state with 50% possibility of P power in assisting control and design methods. Also, the (praline) and 50% possibility of R (arginine). Now the exploration of the application in systems biology of the areas protein is aligned to the HMM according to the probabilities. such as building a knowledge base, choosing models, This model identifies the similarity with other proteins, and analyzing data and evaluating results, will be the trend of AI predicts the multiple alignments for the same family implementation in systems biology in the near future. The members. It draws a dynamic picture of protein sequence. more complex application of autonomous planning, communicating, perceiving and acting is likely to happen after machine learning is well adopted in systems biology. REFERENCES [1] H. Kitano, “System’s biology: a brief overview,” Science, 2002, vol. 295, pp. 1662-1664 [2] J.P. Huelsenbeck, F. Ronquest, and R. Nielsen, “Bayesian Inference of Phylogeny and Its Impact on Evolutionary Biology”, Science, 2002, vol.294, 2310-2318 [3] N.C. Spitzer and T.J. Sejnowski, “Biological Information Processing: Bits of Progress”, Science, 2000, vol. 277, pp. 1060-1063 [4] A. Renner and A. Aszodi, “High-throughput Functional Annotation of Novel Gene Products using Document Clustering”, Pacific Symposium on Biocomputing 2000,pp. 54 -68 Figure 3 Part of HMM for multiple alignment from figure 1 [5] S. J. Russell and P. Norvig, Artificial Intelligence – A Modern Approach, Pearson Education, New Jersey, USA, 2003 [6] A.M Turing, “A Quarterly Review of Psychology and Philosophy, ” In this example HMM, the most popular theory in AI, 1950, Available online: http://www.abelard.org/turpap/turpap.htm describes the multiple alignments in PDGF. It shows AI’s [7] IBM, “Deep Blue”1997, Available online: ability in finding and understanding the dynamics of a http://www.research.ibm.com/deepblue/ biological system. [8] M. Amitai, “Hidden Models in Biopolymers, ” Science, 2001, vol. 282, pp. 1436-1440 [9] C. Yoo and G. F. Cooper, “An Evaluation of a System that Recommends Microarray Experiments to Perform to Discover Gene-Regulation C.Preliminary Application of Machine Learning Pathways,” unpublished. A very important aspect of AI is learning like a human [10] NCBI, “What is GenBank?,” 2003, Availabe online: being. This technique will be greatly helpful for the http://www.ncbi.nlm.nih.gov/Genbank/ modelling procedure. Usually the modelling and modelling [11] J. Sjolander et al., Comput. Appl. Biosci. 1996, vol. 12, pp 327 assumptions are crucial for the systems biology research. [12] S. Eddy, “HMMer: Profile HMMs for protein sequence analysis”, 2003. There could be several possible models that can fit in one available on line: http://hmmer.wustl.edu/ topic. How to find the best choice is very difficult in most [13] UCSC, “Sequence Alignment and Modelling System,” 2003, available online: http://www.cse.ucsc.edu/research/compbio/sam.html cases. Because of no proper method to detect the validity of a model, Huelsenbeck and his colleagues had to use the trial- error method to determine the convergence. In this case, a self-learning model convergence module could have been put in their program to improve the efficiency. Some pioneering studies are in process. C. Yoo and G. F. Cooper introduced a system named GEEVE, which can automatically pick the best model to find a causal pathway in genes [9]. This system will try to recommend the model based on previous results, and adjust the recommendation by recent evaluation [9].