SlideShare a Scribd company logo
1 of 29
Download to read offline
Microarray Analysis with Bayesian Networks

Using BayesiaLab for Cancer Type Classi cation



Stefan Conrady, stefan.conrady@conradyscience.com

Dr. Lionel Jouffe, jouffe@bayesia.com

March 15, 2011




Conrady Applied Science, LLC - Bayesia’s North American Partner for Sales and Consulting
Introduction to Bayesian Networks




Table of Contents

Introduction
    About the Authors                                                   2
        Stefan Conrady                                                  2
        Lionel Jouffe                                                   2

Case Study & Tutorial
  Background                                                            3
    Database                                                            3
    Notation                                                            4

  Classi cation Model                                                   4
    Data Import                                                         4
    Supervised Learning: Augmented Markov Blanket                      10
    Performance Evaluation                                             12
    Network Complexity                                                 12

  Inference                                                            16
    Target Dynamic Pro le                                              17
    Bayes Factor                                                       19
    Target Interpretation Tree                                         20
    Conclusion                                                         21

  Appendix                                                             23
    Markov Blanket                                                     23
    Comparison of Classi cation Performance with Golub et al. (1999)   24

  References                                                           26

  Contact Information                                                  27
        Conrady Applied Science, LLC                                   27
        Bayesia SAS                                                    27

  Copyright                                                            27




www.conradyscience.com | www.bayesia.com
                               i
Microarray Analysis with Bayesian Networks and BayesiaLab




Introduction

In our recent white paper about breast cancer classi cation, we have used Bayesian networks and BayesiaLab for feature
identi cation and prediction of class membership. This study was based on 569 cases and 10 attribute variables, which
allowed estimating a classi cation model with a very high prediction accuracy.

In this new study, we turn to the eld of cancer classi cation by means of microarray analysis. Microarray analysis is a
technique for gene expression pro ling of cell samples. Expression pro les indicate which genes are currently active
among thousands of genes. The activation of certain genes can indicate the type and the current state of a cell.

In our case, we want to use the expression pro les of cell samples from cancer patients to distinguish between different
types of leukemia. Leukemia is a type of cancer of the blood or bone marrow characterized by an abnormal increase of
white blood cells. Clinically and pathologically, leukemia can be divided into a number of groups, of which we will ex-
amine two types of acute leukemia, namely acute lymphoblastic leukemia (ALL) and acute myelogenous leukemia
(AML).

The correct classi cation of the subgroup of leukemia is critical for the selection of the most ef cient therapy, which
may include chemotherapy and radiation, and for minimizing side effects. In general, the progress in correct cancer clas-
si cation in recent years has been crucial for improving the overall treatment success.

One of the challenges in microarray analysis is the sheer number of genes, which could potentially be predictors in a
classi cation model. At the same time, the number of observations tends to be small. So, it is not uncommon to have
thousands of predictors while only having a few dozens of samples. It is precisely the opposite of what one would hope
to have for a traditional statistical analysis.

As a result, many new statistical techniques have emerged in recent decades and one of them is described in detail in
Golub et al. (1999). This study demonstrates that cancer classi cation is feasible on the basis of gene expression data
alone. Since its publication, it has been widely cited and further disseminated, e.g in Slonim et. al (2000) and Dudoit et
al. (2002). Also, the underlying dataset has been made publicly available to any interested researcher by the Broad
Institute.1 Given the seminal nature of the Golub study and its excellent pedagogical qualities, we have chosen it as our
reference point for a new case study and BayesiaLab tutorial.

Our objective is to show that our modeling approach with Bayesian networks (as the framework) and BayesiaLab (as
the software tool) can quickly and effectively generate models of equal or better classi cation performance compared to
models documented in literature, while only requiring a minimum of speci cation effort from the research analyst.

We expect that this new approach will allow researchers to focus a greater portion of their efforts on the subject matter
of their studies, e.g. the biological interpretation, and less on the technicalities of statistical models. Furthermore, the
sheer speed of model creation facilitates a much faster and broader review of existing research data, perhaps leading to
new insights. For instance, users of BayesiaLab should be able to replicate all modeling steps in this case study within a
few hours.

Finally, we should emphasize that our case study is focused exclusively on the modeling aspect of this subject matter,
without providing any medical or biological interpretations. It is not our objective to make a contribution to the medical


1   http://www.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=43


www.conradyscience.com | www.bayesia.com                                                                                  1
Microarray Analysis with Bayesian Networks and BayesiaLab



literature, but rather to showcase a new computational method on the basis real-world data from the eld of biostatis-
tics. Hence, any medical references in this paper are paraphrased from existing research to provide context for the
reader.


About the Authors

Stefan Conrady
Stefan Conrady is the cofounder and managing partner of Conrady Applied Science, LLC, a privately held consulting
 rm specializing in knowledge discovery and probabilistic reasoning with Bayesian networks. In 2010, Conrady Applied
Science was appointed the authorized sales and consulting partner of Bayesia SAS for North America.

Stefan Conrady studied Electrical Engineering and has extensive management experience in the elds of product plan-
ning, marketing and analytics, working at Daimler and BMW Group in Europe, North America and Asia. Prior to estab-
lishing his own rm, he was heading the Analytics & Forecasting group at Nissan North America.

Lionel Jouffe
Dr. Lionel Jouffe is cofounder and CEO of France-based Bayesia SAS. Lionel Jouffe holds a Ph.D. in Computer Science
and has been working in the eld of Arti cial Intelligence since the early 1990s. He and his team have been developing
BayesiaLab since 1999 and it has emerged as the leading software package for knowledge discovery, data mining and
knowledge modeling using Bayesian networks. BayesiaLab enjoys broad acceptance in academic communities as well as
in business and industry. The relevance of Bayesian networks, especially in the context of consumer research, is high-
lighted by Bayesia’s strategic partnership with Procter & Gamble, who has deployed BayesiaLab globally since 2007.




www.conradyscience.com | www.bayesia.com                                                                            2
Microarray Analysis with Bayesian Networks and BayesiaLab



Case Study & Tutorial

Background
To provide the correct medical context for our white paper, we quote Golub et al. (1999), who conducted the original
research and described a new cancer classi cation approach based on global gene expression analysis:

           The challenge of cancer treatment has been to target speci c therapies to pathogenetically distinct tumor types,
           to maximize ef cacy and minimize toxicity. Improvements in cancer classi cation have thus been central to
           advances in cancer treatment. Cancer classi cation has been based primarily on morphological appearance of
           the tumor, but this has serious limitations. Tumors with similar histopathological appearance can follow sig-
           ni cantly different clinical courses and show different responses to therapy.

           For […] tumors, important subclasses are likely to exist but have yet to be de ned by molecular markers. For
           example, prostate cancers of identical grade can have widely variable clinical courses, from indolence over dec-
           ades to explosive growth causing rapid patient death. Cancer classi cation has been dif cult in part because it
           has historically relied on speci c biological insights, rather than systematic and unbiased approaches for recog-
           nizing tumor subtypes.

           Although the distinction between AML2 and ALL3 has been well established, no single test is currently suf -
           cient to establish the diagnosis. Rather, current clinical practice involves an experienced hematopathologist’s
           interpretation of the tumor’s morphology,4 histochemistry, immunophenotyping, and cytogenetic analysis, each
           performed in a separate, highly specialized laboratory. Although usually accurate, leukemia classi cation re-
           mains imperfect and errors do occur.

We will use the Golub dataset and create a new classi cation model within the framework of Bayesian networks as an
alternative to models already documented in the relevant literature. An extensive survey of existing models is provided
in Dudoit et al. (2002).


Database
The Golub study is based on two datasets containing an initially available training set of 38 samples and a second set of
34 samples, which subsequently served as a test set. These datasets contain measurements corresponding to ALL and
AML samples taken from bone marrow and peripheral blood of leukemia patients. For details about the experimental
method and microarray analysis protocol, readers are referred to the original paper.

For all sample cases in the test and training set, we have 7,129 variables, each representing the expression level of an
individual gene. The expression levels are recorded as continuous numerical values. The dependent variable is Leukemia
Type, which can have one of two categorical states, “AML” and “ALL.” Furthermore, the variable Data Type indicates
whether a sample belongs to the Test or the Training set. Finally, Case ID serves a Row Identi er.




2   AML: Acute Myelogenous Leukemia
3   ALL: Acute Lymphoblastic Leukemia
4   An image-based morphology analysis of cell samples was presented in our recent white paper, Breast Cancer Diagnos-
tics with Bayesian Networks.


www.conradyscience.com | www.bayesia.com                                                                                  3
Microarray Analysis with Bayesian Networks and BayesiaLab



With the bene t of having both datasets available simultaneously at the time of our study, we will take advantage of the
additional observations in Golub’s test set.5 Otherwise, we use these data sets in their original format without any fur-
ther transformation, so our results can be compared to results of earlier studies.


Notation
To clearly distinguish between natural language, software-speci c functions and study-speci c variable names, the fol-
lowing notation is used:

• BayesiaLab-speci c functions, keywords, commands, etc., are capitalized and shown in bold type.

• Attribute/variable/node names are italicized.



Classi cation Model

Data Import
Our modeling process begins with importing the database from a CSV format into BayesiaLab. The Data Import Wizard
guides the analyst through the required steps.




It is common practice for gene expression variable names to be recorded as row headers, while cases are identi ed by
column headers, which is just the opposite of the typical arrangement of research data. To accommodate this alternative
format, we can check the Transpose box and thus achieve compatibility with the Data Import Wizard.

For the next step, we need to identify Case ID as Row Identi er. Given their values, BayesiaLab will automatically rec-
ognize Leukemia Type as discrete and all gene variables (AFFX-BioB-5_at through Z78285_f_at) as continuous. As we
will use the training and test set combined, we declare the variable Data Type as Not Distributed.




5   We provide a direct comparison of model performance, based on 38 samples only, in the appendix.


www.conradyscience.com | www.bayesia.com                                                                               4
Microarray Analysis with Bayesian Networks and BayesiaLab




The following step in the Data Import Wizard normally de nes missing values processing, which is not required in our
case. So, the following screen can be skipped entirely by clicking Next.




The following step, however, is critical in the import process. It de nes the Discretization and Aggregation process,
which is necessary, as in BayesiaLab Bayesian networks are only de ned for discrete states.

As the exclusive objective of this model is classi cation, we will choose the Decision Tree algorithm, which will discre-
tize each variable for an optimum information gain with respect to the target variable Leukemia Type. This is particu-
larly helpful, as we do not have any a-priori knowledge about the relevance or the meaning of any of the expression
levels.




www.conradyscience.com | www.bayesia.com                                                                               5
Microarray Analysis with Bayesian Networks and BayesiaLab




Theoretically, we could de ne the discretization algorithm for each variable individually, but with over 7,000 variables
this is obviously not practical. So we will click Select All Continuous and the pick Decision Tree from the drop-down
menu.




Furthermore, we need to de ne the maximum number of interval levels for the discretization. Given the very small
number of observations in the dataset, we recommend no more than 3 discretization levels, although the algorithm may
subsequently further reduce the number of intervals to 2 for individual variables.




Upon clicking Finish, BayesiaLab will proceed with the import process and report its progress via a status bar.




However, we will soon receive a warning that the “Decision Tree 3” discretization was not possible for one of the vari-
ables. This means that this variable could not be discretized in a way that would optimize the information gain for the




www.conradyscience.com | www.bayesia.com                                                                              6
Microarray Analysis with Bayesian Networks and BayesiaLab



target variable. As a fallback option, we will choose the K-Means Discretization algorithm with 3 intervals. As this situa-
tion might apply to other variables as well, we can check Remember My Choice.




The analyst can observe the status of the discretization via a progress bar.




Upon completion, we have the option of displaying the Import Report, which will display the nal discretization type
for every single variable in the database.




www.conradyscience.com | www.bayesia.com                                                                                 7
Microarray Analysis with Bayesian Networks and BayesiaLab



The following screenshot shows 7 out of 7,130 variables with the associated intervals and the type of discretization ob-
tained (red indicates Decision Tree Discretization, blue indicates K-Means Discretization).




Upon closing the report, we will see a representation of the newly imported database in the form of a fully unconnected
Bayesian network. Each variable is now represented as a blue node in the Graph Panel of BayesiaLab. Given the large
number of variables, we will only show a small portion of them in the screenshot below.




www.conradyscience.com | www.bayesia.com                                                                              8
Microarray Analysis with Bayesian Networks and BayesiaLab




It is also good practice to “spot check” some of the newly created nodes. We can do that by simply double-clicking on
any node in the network. The Node Editor will open up and allow the analyst to review and edit any of the properties, if
necessary.




For instance, we may have some a-priori expert knowledge that one particular variable can only assume either one of
two states, e.g. high/low, and that there is a speci c threshold, which separates those two states. The Node Editor allows
us to change the discretization thresholds via point-and-click directly in the probability density chart of the variable. Just
for illustration purposes, we have arbitrarily created a discretization threshold at the median value in the following




www.conradyscience.com | www.bayesia.com                                                                                    9
Microarray Analysis with Bayesian Networks and BayesiaLab



screenshot. We have the ability to return to the Node Editor at any time and re-discretize (or change other attributes) of
variables as needed.




Supervised Learning: Augmented Markov Blanket
As the starting point for the generation of our Bayesian network model, we will de ne Leukemia Type as the Target
Variable. This can be done by right-clicking the Leukemia Type and selecting Set As Target Node. Double-clicking the
node while pressing the “T” key will do the same.




Beyond the ability to predict class membership of future samples based on their expression levels, we are very interested
in nding a manageable subset of variables that can be used as predictors, i.e. we want perform a feature selection.

The Markov Blanket algorithm is suitable for this kind of application and its speed is particularly helpful when dealing
with thousands of variables.6 Furthermore, BayesiaLab offers the Augmented Markov Blanket, which starts with the
Markov Blanket structure and then uses an unsupervised search to nd  the probabilistic relations that hold between



6   See appendix for a de nition of the Markov Blanket


www.conradyscience.com | www.bayesia.com                                                                               10
Microarray Analysis with Bayesian Networks and BayesiaLab



each variable belonging to the Markov Blanket.7 This unsupervised search requires additional computation time but
generally results in an improved predictive performance of the model.

The learning process can be started by selecting Learning>Target Node Characterization>Augmented Markov Blanket
from the menu.




After a few seconds, we will see the result of the machine learning process. Our Target Node Leukemia Type is now
connected to all variables in its Markov Blanket.




To show these connections, we need to zoom out and at this level this individual nodes are barely visible dots on the
screen. However, the connections in the Markov Blanket are very prominent and we can see that only 55 variables out
of over 7,000 were selected as predictors.


7   Intuitively, the “augmented” part of the network plays the same role as the interaction terms in a regression.


www.conradyscience.com | www.bayesia.com                                                                             11
Microarray Analysis with Bayesian Networks and BayesiaLab



Performance Evaluation
In order to see whether this selection proves to be adequate for classi cation purposes, we switch into the Validation
Mode by pressing the F5 key and start the performance evaluation.

As we do not have a separate test and training set, we will need to use Cross-Validation for evaluation purposes. Cross-
validation is a technique for assessing how the predictions of a model will generalize to an independent data set. One
round of Cross-Validation involves partitioning a sample of data into complementary subsets, estimating the model on
one subset, and then validating the analysis on the other subset. To reduce variability, multiple rounds of Cross-
Validation are performed using different partitions, and the validation results are averaged over the rounds.

To start the process, we select Tools>Cross Validation>Targeted from the menu.




The performance report shows that all of the 47 ALL cases, 45 were correctly classi ed and of the 25 AML, 24 cases
were correctly identi ed, too. This yields a total precision of almost 96%.




Network Complexity
Beyond precision, parsimony is a key objective in most modeling tasks. This need for simplicity is particularly obvious in
the presence of thousands of potential predictors. Golub’s initial model used those 50 genes as predictors, which were



www.conradyscience.com | www.bayesia.com                                                                               12
Microarray Analysis with Bayesian Networks and BayesiaLab



most correlated with the AML-ALL distinction. Our Augmented Markov Blanket model found 55 predictors, and, as it
turns out, 14 out of the 55 predictors are in common with Golub’s list of the 50 most correlated variables.

Although the selection of 55 predictors would perhaps be suf cient for a practical application, we can investigate
whether a selection of fewer predictors can still yield reliable results. BayesiaLab allows us to manage network complex-
ity via the Structural Coef cient (SC). By default, the value of SC is set to 1, however, we can increase its value to force a
simpler network structure, which, in the case of our model, would mean fewer predictors. We will set SC arbitrarily to 2
and use the Augmented Markov Blanket learning again.

The resulting network now only contains 16 predictors:




To evaluate the performance of this network we will repeat the Cross-Validation.




www.conradyscience.com | www.bayesia.com                                                                                   13
Microarray Analysis with Bayesian Networks and BayesiaLab




Interestingly, despite the smaller number of predictors, the classi cation performance has actually improved slightly and
now stands at 97%.

Given this performance of a simpler model, we may wish to increase the SC further and see whether we will still have a
reasonable model. Outside a case study and in a real-world situation, the analyst would have to make the determination
as to how far to take this. Our pursuit of an even simpler structure only serves demonstration purposes.

We now set SC=3 and once again learn the Augmented Markov Blanket. The number of predictors is now narrowed
down to 5.




www.conradyscience.com | www.bayesia.com                                                                              14
Microarray Analysis with Bayesian Networks and BayesiaLab




Repeating the Cross-Validation yields that the precision remains at the same level as before, i.e. at 97%




www.conradyscience.com | www.bayesia.com                                                                    15
Microarray Analysis with Bayesian Networks and BayesiaLab



At rst glance it might be tempting to chose this very simple model with only ve predictors for practical application.
However, practical considerations may actually suggest otherwise. Given the delicate nature of microarray analysis,
missing values, measurement errors, etc. are to be expected. A model with more predictors, including redundant predic-
tors, will be less sensitive to noise and thus more robust overall.



Inference
With the small number of predictors in our most recent network, we can inspect them more closely and review their
states in the Monitor Panel. We can display their Monitors by rst selecting the nodes in the Graph Panel and then by
right-clicking on Monitor from the contextual menu.




We can now see their states, the thresholds between the states and their marginal distribution.




So far, we have only made inference about the state of Leukemia Type based on the states of the predictors, i.e. P(Leu-
kemia Type | gene).8 We can use a key property of Bayesian networks here, namely omnidirectional inference. This al-
lows to compute the posterior probability of the states of the predictors, given the state of Leukemia Type:
P(gene | Leukemia Type).




8   gene = (e1, e2, e3, …, en), i.e. a vector consisting of n expression levels. Here, n is the number of predictors.


www.conradyscience.com | www.bayesia.com                                                                                16
Microarray Analysis with Bayesian Networks and BayesiaLab



For Leukemia Type=ALL, we obtain the following posterior distributions:




Conversely, for Leukemia Type=AML, we obtain these posterior distributions:




The very manageable number of predictors certainly makes it easy to further examine their speci c roles in the classi ca-
tion model. BayesiaLab provides a number of tools that assist with interpreting the variables.


Target Dynamic Pro le
The Target Dynamic Pro le function is typically used to search for the states of predictor variables, which optimize the
desired state of the target variable. In our case, however, we are interested in those states of the predictor variables,
which provide the greatest amount of information for distinguishing between the states AML and ALL of the Leukemia
Type variable.

The function can be invoked by selecting Analysis>Reports>Target Dynamic Pro le from the menu.




www.conradyscience.com | www.bayesia.com                                                                              17
Microarray Analysis with Bayesian Networks and BayesiaLab



In the following dialogue we select Probability, Criterion Maximization, Take Into Account the Joint Probability and
Hard Evidence.




The rst result shows ALL vs. AML. This means that observing the lowest state (1/3) of M23791 would change the
probability of ALL from 65% (a priori) to 97.5% (a posteriori).

Conversely, the probability of AML increases from 34% (a priori) to 91.5% (a posteriori), given that the higher state (2/
2) of X95735 is observed. A second piece of evidence, the higher state (2/2) of U46499, would further increase the AML
probability to 99%.




www.conradyscience.com | www.bayesia.com                                                                              18
Microarray Analysis with Bayesian Networks and BayesiaLab




Bayes Factor
BayesiaLab offers an additional metric for interpreting the impact of observing a speci c piece of evidence, namely by
providing values of the Bayes Factor for each variable.

We de ne the Bayes Factor here as


          ⎛ P(H E) ⎞
K = log 2 ⎜
          ⎝ P(H ) ⎟
                           ,
                   ⎠

which, from a statistical perspective, can be seen as the strength of the observed evidence E with respect to the hypothe-
sis H. From an information theory perspective, the Bayes Factor quanti es of the modi cation of the state variable un-
certainty once we have the evidence E.

For instance, after observing the evidence of M23197 = 1/3, which is the rst of three states (see green bar in the associ-
ated Monitor),




www.conradyscience.com | www.bayesia.com                                                                               19
Microarray Analysis with Bayesian Networks and BayesiaLab



we can observe the impact on all the other variables by selecting Analysis>Report>Evidence Analysis.




The rightmost column shows the Bayes Factor. For the state ALL of variable Leukemia Type it is 0.5832 and for state
AML it is -3.7942. This means that this observation provides much stronger evidence against AML versus positive evi-
dence for ALL.

As we noted earlier, inference within Bayesian networks is always omnidirectional, so we obtain Bayes Factors for all
other variables in the network.


Target Interpretation Tree
So, observing the speci c states of any of these genes should update one’s belief about the likely state of Leukemia Type.
BayesiaLab can also represent this sequence of “belief updating given evidence” in the form of a tree. We can generate
such a tree by selecting Analysis>Target’s Interpretation Tree.




To show this Target Interpretation Tree, we have limited the number of pieces of evidence to three and omitted the rest,
so the tree can t on one page and still remain legible.

By reading the tree from left to right, we can see how each piece of evidence updates the probabilities of AML versus
ALL. For instance, starting at the root and observing the middle value (2/3) for M23197 we would follow the center


www.conradyscience.com | www.bayesia.com                                                                               20
Microarray Analysis with Bayesian Networks and BayesiaLab



path and see that our new conditional probability for ALL is now 40.6%, i.e. it now lower than the original marginal
probability of 65%. Observing the next evidence, U46499, can change the picture again. Given the lower value (1/2) for
U46499, the conditional probability for ALL would increase to 98.5%, and so on.




Conclusion
We have demonstrated that, with Bayesian networks as the framework and BayesiaLab as a software tool, we can rap-
idly generate a reliable classi cation model on the basis of gene expression data.

Beyond the good classi cation performance, the Augmented Markov Blanket model provides an effective means of fea-
ture selection among thousands of potential predictors, with a minimum of speci cation effort. Furthermore, the ex-
treme speed of feature selection allows the researcher to move quickly from data acquisition to interpretation.

Finally, the user-friendly interface of BayesiaLab makes the required work ow easily accessible to any subject matter
expert and without requiring to write any program code, which is typically required in this domain.




www.conradyscience.com | www.bayesia.com                                                                           21
Microarray Analysis with Bayesian Networks and BayesiaLab




www.conradyscience.com | www.bayesia.com                    22
Microarray Analysis with Bayesian Networks and BayesiaLab



Appendix

Markov Blanket
In many cases, the Markov Blanket algorithm is a good starting point for any predictive model, whether used for scoring
or classi cation. This algorithm is extremely fast and can even be applied to databases with thousands of variables and
millions of records.

The Markov Blanket for a node A is the set of nodes composed of A’s parents, its children, and its children’s other par-
ents (=spouses).




The Markov Blanket of the node A contains all the variables, which, if we know their states, will shield the node A from
the rest of the network. This means that the Markov Blanket of a node is the only knowledge needed to predict the be-
havior of that node A. Learning a Markov Blanket selects relevant predictor variables, which is particularly helpful
when there is a large number of variables in the database (In fact, this can also serve as a highly-ef cient variable selec-
tion method in preparation for other types of modeling, outside the Bayesian network framework).




www.conradyscience.com | www.bayesia.com                                                                                 23
Microarray Analysis with Bayesian Networks and BayesiaLab



Comparison of Classi cation Performance with Golub et al. (1999)
In order to facilitate a direct comparison of our approach with the performance of the Golub model, we now constrain
our data set to the initial 38 cases. Based on this smaller dataset, we learn the Augmented Markov Blanket, which selects
20 variables as predictors.




Subsequently, we follow Golub’s approach of applying a leave-one-out cross-validation for performance evaluation.




www.conradyscience.com | www.bayesia.com                                                                              24
Microarray Analysis with Bayesian Networks and BayesiaLab




We obtain a 97% precision, correctly classifying 37 out of the 38 cases. The Golub model correctly identi ed 36 cases
and declared the remaining two cases as uncertain. Within our model, we cannot detect any “borderline” cases in terms
of classi cation strength, which means that one case was indeed misclassi ed.




www.conradyscience.com | www.bayesia.com                                                                          25
Microarray Analysis with Bayesian Networks and BayesiaLab



References
Conrady, Stefan, and Lionel Jouffe. “Breast Cancer Diagnostics with Bayesian Networks”. Conrady Applied Science,
    LLC, March 5, 2011. http://www.conradyscience.com/index.php/wbcd.
Dudoit, S., J. Fridlyand, and T. P Speed. “Comparison of discrimination methods for the classi cation of tumors using
    gene expression data.” Journal of the American statistical association 97, no. 457 (2002): 77–87.

Friedman, N., M. Linial, I. Nachman, and D. Pe’er. “Using Bayesian networks to analyze expression data.” Journal of
     computational biology 7, no. 3-4 (2000): 601–620.
Gentleman, Robert. “Reproducible Research: A Bioinformatics Case Study.” Statistical Applications in Genetics and
    Molecular Biology 4, no. 1 (2005). http://www.bepress.com/sagmb/vol4/iss1/art2.
Golub, T. R. “Molecular Classi cation of Cancer: Class Discovery and Class Prediction by Gene Expression Monitor-
    ing.” Science 286, no. 5439 (1999): 531-537.
Husmeier, Dirk. “Bayesian Networks for Analysing Gene Expression Data.” Bayesian Networks for Analysing Gene
    Expression Data, August 2001. http://www.bioss.ac.uk/~dirk/essays/GeneExpression/bayes_net.html.
“Markov Blanket.” Wikipedia. http://en.wikipedia.org/wiki/Markov_blanket.
Slonim, Donna K, Pablo Tamayo, Jill P Mesirov, Todd R Golub, Eric S Lander, and Eric S L. “Class Prediction and Dis-
     covery Using Gene Expression Data” 2000 (2000): 263--272.
Torgo, Luis. Data Mining with R: Learning with Case Studies. 1st ed. Chapman and Hall/CRC, 2010.




www.conradyscience.com | www.bayesia.com                                                                                26
Microarray Analysis with Bayesian Networks and BayesiaLab



Contact Information

Conrady Applied Science, LLC
312 Hamlet’s End Way
Franklin, TN 37067
USA
+1 888-386-8383
info@conradyscience.com
www.conradyscience.com

Bayesia SAS
6, rue Léonard de Vinci
BP 119
53001 Laval Cedex
France
+33(0)2 43 49 75 69
info@bayesia.com
www.bayesia.com




Copyright
© 2011 Conrady Applied Science, LLC and Bayesia SAS. All rights reserved.

Any redistribution or reproduction of part or all of the contents in any form is prohibited other than the following:

• You may print or download this document for your personal and noncommercial use only.

• You may copy the content to individual third parties for their personal use, but only if you acknowledge Conrady
  Applied Science, LLC and Bayesia SAS as the source of the material.

• You may not, except with our express written permission, distribute or commercially exploit the content. Nor may
  you transmit it or store it in any other website or other form of electronic retrieval system.




www.conradyscience.com | www.bayesia.com                                                                                27

More Related Content

Viewers also liked

Viewers also liked (6)

Visual Analytic Approaches for the Analysis of Spontaneously Reported Adverse...
Visual Analytic Approaches for the Analysis of Spontaneously Reported Adverse...Visual Analytic Approaches for the Analysis of Spontaneously Reported Adverse...
Visual Analytic Approaches for the Analysis of Spontaneously Reported Adverse...
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
 
Bayesian classification
Bayesian classificationBayesian classification
Bayesian classification
 
Fraud Management Solutions
Fraud Management SolutionsFraud Management Solutions
Fraud Management Solutions
 
BayesiaLab Knowledge Elicitation Environment
BayesiaLab Knowledge Elicitation EnvironmentBayesiaLab Knowledge Elicitation Environment
BayesiaLab Knowledge Elicitation Environment
 
Bayes Theorem
Bayes TheoremBayes Theorem
Bayes Theorem
 

Similar to Microarray Analysis with BayesiaLab

Causal Inference and Direct Effects
Causal Inference and Direct EffectsCausal Inference and Direct Effects
Causal Inference and Direct Effects
jouffe
 
Metabolomics Society meeting 2011 - presentatie Kees
Metabolomics Society meeting 2011 - presentatie KeesMetabolomics Society meeting 2011 - presentatie Kees
Metabolomics Society meeting 2011 - presentatie Kees
thehyve
 
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Arinze Akutekwe
 
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
ijtsrd
 

Similar to Microarray Analysis with BayesiaLab (20)

Breast Cancer Diagnostics with Bayesian Networks
Breast Cancer Diagnostics with Bayesian NetworksBreast Cancer Diagnostics with Bayesian Networks
Breast Cancer Diagnostics with Bayesian Networks
 
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
 
Deep Learning for Leukemia Detection: A MobileNetV2-Based Approach for Accura...
Deep Learning for Leukemia Detection: A MobileNetV2-Based Approach for Accura...Deep Learning for Leukemia Detection: A MobileNetV2-Based Approach for Accura...
Deep Learning for Leukemia Detection: A MobileNetV2-Based Approach for Accura...
 
Causal Inference and Direct Effects
Causal Inference and Direct EffectsCausal Inference and Direct Effects
Causal Inference and Direct Effects
 
Causal Inference and Direct Effects
Causal Inference and Direct EffectsCausal Inference and Direct Effects
Causal Inference and Direct Effects
 
Updated proposal powerpoint.pptx
Updated proposal powerpoint.pptxUpdated proposal powerpoint.pptx
Updated proposal powerpoint.pptx
 
Metabolomics Society meeting 2011 - presentatie Kees
Metabolomics Society meeting 2011 - presentatie KeesMetabolomics Society meeting 2011 - presentatie Kees
Metabolomics Society meeting 2011 - presentatie Kees
 
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MININGANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
 
Next Gen Sequencing and Associated Big Data / AI problem
Next Gen Sequencing and Associated Big Data / AI problemNext Gen Sequencing and Associated Big Data / AI problem
Next Gen Sequencing and Associated Big Data / AI problem
 
Semantic representation of neuroimaging observation
Semantic representation of neuroimaging observationSemantic representation of neuroimaging observation
Semantic representation of neuroimaging observation
 
Define cancer treatment using knn and naive bayes algorithms
Define cancer treatment using knn and naive bayes algorithmsDefine cancer treatment using knn and naive bayes algorithms
Define cancer treatment using knn and naive bayes algorithms
 
S4502115119
S4502115119S4502115119
S4502115119
 
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data StreamsNovel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2K
 
Advances in computer aided drug design
Advances in computer aided drug designAdvances in computer aided drug design
Advances in computer aided drug design
 
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
 
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
 
Technology R&D Theme 2: From Descriptive to Predictive Networks
Technology R&D Theme 2: From Descriptive to Predictive NetworksTechnology R&D Theme 2: From Descriptive to Predictive Networks
Technology R&D Theme 2: From Descriptive to Predictive Networks
 

More from Bayesia USA

BayesiaLab_Book_V18 (1)
BayesiaLab_Book_V18 (1)BayesiaLab_Book_V18 (1)
BayesiaLab_Book_V18 (1)
Bayesia USA
 
Loyalty_Driver_Analysis_V13b
Loyalty_Driver_Analysis_V13bLoyalty_Driver_Analysis_V13b
Loyalty_Driver_Analysis_V13b
Bayesia USA
 
vehicle_safety_v20b
vehicle_safety_v20bvehicle_safety_v20b
vehicle_safety_v20b
Bayesia USA
 
Impact Analysis V12
Impact Analysis V12Impact Analysis V12
Impact Analysis V12
Bayesia USA
 

More from Bayesia USA (15)

BayesiaLab_Book_V18 (1)
BayesiaLab_Book_V18 (1)BayesiaLab_Book_V18 (1)
BayesiaLab_Book_V18 (1)
 
Loyalty_Driver_Analysis_V13b
Loyalty_Driver_Analysis_V13bLoyalty_Driver_Analysis_V13b
Loyalty_Driver_Analysis_V13b
 
vehicle_safety_v20b
vehicle_safety_v20bvehicle_safety_v20b
vehicle_safety_v20b
 
Impact Analysis V12
Impact Analysis V12Impact Analysis V12
Impact Analysis V12
 
Causality for Policy Assessment and 
Impact Analysis
Causality for Policy Assessment and 
Impact AnalysisCausality for Policy Assessment and 
Impact Analysis
Causality for Policy Assessment and 
Impact Analysis
 
Vehicle Size, Weight, and Injury Risk: High-Dimensional Modeling and
 Causal ...
Vehicle Size, Weight, and Injury Risk: High-Dimensional Modeling and
 Causal ...Vehicle Size, Weight, and Injury Risk: High-Dimensional Modeling and
 Causal ...
Vehicle Size, Weight, and Injury Risk: High-Dimensional Modeling and
 Causal ...
 
The Bayesia Portfolio of Research Software
The Bayesia Portfolio of Research SoftwareThe Bayesia Portfolio of Research Software
The Bayesia Portfolio of Research Software
 
Bayesian Networks & BayesiaLab
Bayesian Networks & BayesiaLabBayesian Networks & BayesiaLab
Bayesian Networks & BayesiaLab
 
Knowledge Discovery in the Stock Market
Knowledge Discovery in the Stock MarketKnowledge Discovery in the Stock Market
Knowledge Discovery in the Stock Market
 
Paradoxes and Fallacies - Resolving some well-known puzzles with Bayesian net...
Paradoxes and Fallacies - Resolving some well-known puzzles with Bayesian net...Paradoxes and Fallacies - Resolving some well-known puzzles with Bayesian net...
Paradoxes and Fallacies - Resolving some well-known puzzles with Bayesian net...
 
Probabilistic Latent Factor Induction and
 Statistical Factor Analysis
Probabilistic Latent Factor Induction and
 Statistical Factor AnalysisProbabilistic Latent Factor Induction and
 Statistical Factor Analysis
Probabilistic Latent Factor Induction and
 Statistical Factor Analysis
 
Modeling Vehicle Choice and Simulating Market Share with Bayesian Networks
Modeling Vehicle Choice and Simulating Market Share with Bayesian NetworksModeling Vehicle Choice and Simulating Market Share with Bayesian Networks
Modeling Vehicle Choice and Simulating Market Share with Bayesian Networks
 
Driver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian NetworksDriver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian Networks
 
BayesiaLab 5.0 Introduction
BayesiaLab 5.0 IntroductionBayesiaLab 5.0 Introduction
BayesiaLab 5.0 Introduction
 
Car And Driver Hk Interview
Car And Driver Hk InterviewCar And Driver Hk Interview
Car And Driver Hk Interview
 

Microarray Analysis with BayesiaLab

  • 1. Microarray Analysis with Bayesian Networks Using BayesiaLab for Cancer Type Classi cation Stefan Conrady, stefan.conrady@conradyscience.com Dr. Lionel Jouffe, jouffe@bayesia.com March 15, 2011 Conrady Applied Science, LLC - Bayesia’s North American Partner for Sales and Consulting
  • 2. Introduction to Bayesian Networks Table of Contents Introduction About the Authors 2 Stefan Conrady 2 Lionel Jouffe 2 Case Study & Tutorial Background 3 Database 3 Notation 4 Classi cation Model 4 Data Import 4 Supervised Learning: Augmented Markov Blanket 10 Performance Evaluation 12 Network Complexity 12 Inference 16 Target Dynamic Pro le 17 Bayes Factor 19 Target Interpretation Tree 20 Conclusion 21 Appendix 23 Markov Blanket 23 Comparison of Classi cation Performance with Golub et al. (1999) 24 References 26 Contact Information 27 Conrady Applied Science, LLC 27 Bayesia SAS 27 Copyright 27 www.conradyscience.com | www.bayesia.com i
  • 3. Microarray Analysis with Bayesian Networks and BayesiaLab Introduction In our recent white paper about breast cancer classi cation, we have used Bayesian networks and BayesiaLab for feature identi cation and prediction of class membership. This study was based on 569 cases and 10 attribute variables, which allowed estimating a classi cation model with a very high prediction accuracy. In this new study, we turn to the eld of cancer classi cation by means of microarray analysis. Microarray analysis is a technique for gene expression pro ling of cell samples. Expression pro les indicate which genes are currently active among thousands of genes. The activation of certain genes can indicate the type and the current state of a cell. In our case, we want to use the expression pro les of cell samples from cancer patients to distinguish between different types of leukemia. Leukemia is a type of cancer of the blood or bone marrow characterized by an abnormal increase of white blood cells. Clinically and pathologically, leukemia can be divided into a number of groups, of which we will ex- amine two types of acute leukemia, namely acute lymphoblastic leukemia (ALL) and acute myelogenous leukemia (AML). The correct classi cation of the subgroup of leukemia is critical for the selection of the most ef cient therapy, which may include chemotherapy and radiation, and for minimizing side effects. In general, the progress in correct cancer clas- si cation in recent years has been crucial for improving the overall treatment success. One of the challenges in microarray analysis is the sheer number of genes, which could potentially be predictors in a classi cation model. At the same time, the number of observations tends to be small. So, it is not uncommon to have thousands of predictors while only having a few dozens of samples. It is precisely the opposite of what one would hope to have for a traditional statistical analysis. As a result, many new statistical techniques have emerged in recent decades and one of them is described in detail in Golub et al. (1999). This study demonstrates that cancer classi cation is feasible on the basis of gene expression data alone. Since its publication, it has been widely cited and further disseminated, e.g in Slonim et. al (2000) and Dudoit et al. (2002). Also, the underlying dataset has been made publicly available to any interested researcher by the Broad Institute.1 Given the seminal nature of the Golub study and its excellent pedagogical qualities, we have chosen it as our reference point for a new case study and BayesiaLab tutorial. Our objective is to show that our modeling approach with Bayesian networks (as the framework) and BayesiaLab (as the software tool) can quickly and effectively generate models of equal or better classi cation performance compared to models documented in literature, while only requiring a minimum of speci cation effort from the research analyst. We expect that this new approach will allow researchers to focus a greater portion of their efforts on the subject matter of their studies, e.g. the biological interpretation, and less on the technicalities of statistical models. Furthermore, the sheer speed of model creation facilitates a much faster and broader review of existing research data, perhaps leading to new insights. For instance, users of BayesiaLab should be able to replicate all modeling steps in this case study within a few hours. Finally, we should emphasize that our case study is focused exclusively on the modeling aspect of this subject matter, without providing any medical or biological interpretations. It is not our objective to make a contribution to the medical 1 http://www.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=43 www.conradyscience.com | www.bayesia.com 1
  • 4. Microarray Analysis with Bayesian Networks and BayesiaLab literature, but rather to showcase a new computational method on the basis real-world data from the eld of biostatis- tics. Hence, any medical references in this paper are paraphrased from existing research to provide context for the reader. About the Authors Stefan Conrady Stefan Conrady is the cofounder and managing partner of Conrady Applied Science, LLC, a privately held consulting rm specializing in knowledge discovery and probabilistic reasoning with Bayesian networks. In 2010, Conrady Applied Science was appointed the authorized sales and consulting partner of Bayesia SAS for North America. Stefan Conrady studied Electrical Engineering and has extensive management experience in the elds of product plan- ning, marketing and analytics, working at Daimler and BMW Group in Europe, North America and Asia. Prior to estab- lishing his own rm, he was heading the Analytics & Forecasting group at Nissan North America. Lionel Jouffe Dr. Lionel Jouffe is cofounder and CEO of France-based Bayesia SAS. Lionel Jouffe holds a Ph.D. in Computer Science and has been working in the eld of Arti cial Intelligence since the early 1990s. He and his team have been developing BayesiaLab since 1999 and it has emerged as the leading software package for knowledge discovery, data mining and knowledge modeling using Bayesian networks. BayesiaLab enjoys broad acceptance in academic communities as well as in business and industry. The relevance of Bayesian networks, especially in the context of consumer research, is high- lighted by Bayesia’s strategic partnership with Procter & Gamble, who has deployed BayesiaLab globally since 2007. www.conradyscience.com | www.bayesia.com 2
  • 5. Microarray Analysis with Bayesian Networks and BayesiaLab Case Study & Tutorial Background To provide the correct medical context for our white paper, we quote Golub et al. (1999), who conducted the original research and described a new cancer classi cation approach based on global gene expression analysis: The challenge of cancer treatment has been to target speci c therapies to pathogenetically distinct tumor types, to maximize ef cacy and minimize toxicity. Improvements in cancer classi cation have thus been central to advances in cancer treatment. Cancer classi cation has been based primarily on morphological appearance of the tumor, but this has serious limitations. Tumors with similar histopathological appearance can follow sig- ni cantly different clinical courses and show different responses to therapy. For […] tumors, important subclasses are likely to exist but have yet to be de ned by molecular markers. For example, prostate cancers of identical grade can have widely variable clinical courses, from indolence over dec- ades to explosive growth causing rapid patient death. Cancer classi cation has been dif cult in part because it has historically relied on speci c biological insights, rather than systematic and unbiased approaches for recog- nizing tumor subtypes. Although the distinction between AML2 and ALL3 has been well established, no single test is currently suf - cient to establish the diagnosis. Rather, current clinical practice involves an experienced hematopathologist’s interpretation of the tumor’s morphology,4 histochemistry, immunophenotyping, and cytogenetic analysis, each performed in a separate, highly specialized laboratory. Although usually accurate, leukemia classi cation re- mains imperfect and errors do occur. We will use the Golub dataset and create a new classi cation model within the framework of Bayesian networks as an alternative to models already documented in the relevant literature. An extensive survey of existing models is provided in Dudoit et al. (2002). Database The Golub study is based on two datasets containing an initially available training set of 38 samples and a second set of 34 samples, which subsequently served as a test set. These datasets contain measurements corresponding to ALL and AML samples taken from bone marrow and peripheral blood of leukemia patients. For details about the experimental method and microarray analysis protocol, readers are referred to the original paper. For all sample cases in the test and training set, we have 7,129 variables, each representing the expression level of an individual gene. The expression levels are recorded as continuous numerical values. The dependent variable is Leukemia Type, which can have one of two categorical states, “AML” and “ALL.” Furthermore, the variable Data Type indicates whether a sample belongs to the Test or the Training set. Finally, Case ID serves a Row Identi er. 2 AML: Acute Myelogenous Leukemia 3 ALL: Acute Lymphoblastic Leukemia 4 An image-based morphology analysis of cell samples was presented in our recent white paper, Breast Cancer Diagnos- tics with Bayesian Networks. www.conradyscience.com | www.bayesia.com 3
  • 6. Microarray Analysis with Bayesian Networks and BayesiaLab With the bene t of having both datasets available simultaneously at the time of our study, we will take advantage of the additional observations in Golub’s test set.5 Otherwise, we use these data sets in their original format without any fur- ther transformation, so our results can be compared to results of earlier studies. Notation To clearly distinguish between natural language, software-speci c functions and study-speci c variable names, the fol- lowing notation is used: • BayesiaLab-speci c functions, keywords, commands, etc., are capitalized and shown in bold type. • Attribute/variable/node names are italicized. Classi cation Model Data Import Our modeling process begins with importing the database from a CSV format into BayesiaLab. The Data Import Wizard guides the analyst through the required steps. It is common practice for gene expression variable names to be recorded as row headers, while cases are identi ed by column headers, which is just the opposite of the typical arrangement of research data. To accommodate this alternative format, we can check the Transpose box and thus achieve compatibility with the Data Import Wizard. For the next step, we need to identify Case ID as Row Identi er. Given their values, BayesiaLab will automatically rec- ognize Leukemia Type as discrete and all gene variables (AFFX-BioB-5_at through Z78285_f_at) as continuous. As we will use the training and test set combined, we declare the variable Data Type as Not Distributed. 5 We provide a direct comparison of model performance, based on 38 samples only, in the appendix. www.conradyscience.com | www.bayesia.com 4
  • 7. Microarray Analysis with Bayesian Networks and BayesiaLab The following step in the Data Import Wizard normally de nes missing values processing, which is not required in our case. So, the following screen can be skipped entirely by clicking Next. The following step, however, is critical in the import process. It de nes the Discretization and Aggregation process, which is necessary, as in BayesiaLab Bayesian networks are only de ned for discrete states. As the exclusive objective of this model is classi cation, we will choose the Decision Tree algorithm, which will discre- tize each variable for an optimum information gain with respect to the target variable Leukemia Type. This is particu- larly helpful, as we do not have any a-priori knowledge about the relevance or the meaning of any of the expression levels. www.conradyscience.com | www.bayesia.com 5
  • 8. Microarray Analysis with Bayesian Networks and BayesiaLab Theoretically, we could de ne the discretization algorithm for each variable individually, but with over 7,000 variables this is obviously not practical. So we will click Select All Continuous and the pick Decision Tree from the drop-down menu. Furthermore, we need to de ne the maximum number of interval levels for the discretization. Given the very small number of observations in the dataset, we recommend no more than 3 discretization levels, although the algorithm may subsequently further reduce the number of intervals to 2 for individual variables. Upon clicking Finish, BayesiaLab will proceed with the import process and report its progress via a status bar. However, we will soon receive a warning that the “Decision Tree 3” discretization was not possible for one of the vari- ables. This means that this variable could not be discretized in a way that would optimize the information gain for the www.conradyscience.com | www.bayesia.com 6
  • 9. Microarray Analysis with Bayesian Networks and BayesiaLab target variable. As a fallback option, we will choose the K-Means Discretization algorithm with 3 intervals. As this situa- tion might apply to other variables as well, we can check Remember My Choice. The analyst can observe the status of the discretization via a progress bar. Upon completion, we have the option of displaying the Import Report, which will display the nal discretization type for every single variable in the database. www.conradyscience.com | www.bayesia.com 7
  • 10. Microarray Analysis with Bayesian Networks and BayesiaLab The following screenshot shows 7 out of 7,130 variables with the associated intervals and the type of discretization ob- tained (red indicates Decision Tree Discretization, blue indicates K-Means Discretization). Upon closing the report, we will see a representation of the newly imported database in the form of a fully unconnected Bayesian network. Each variable is now represented as a blue node in the Graph Panel of BayesiaLab. Given the large number of variables, we will only show a small portion of them in the screenshot below. www.conradyscience.com | www.bayesia.com 8
  • 11. Microarray Analysis with Bayesian Networks and BayesiaLab It is also good practice to “spot check” some of the newly created nodes. We can do that by simply double-clicking on any node in the network. The Node Editor will open up and allow the analyst to review and edit any of the properties, if necessary. For instance, we may have some a-priori expert knowledge that one particular variable can only assume either one of two states, e.g. high/low, and that there is a speci c threshold, which separates those two states. The Node Editor allows us to change the discretization thresholds via point-and-click directly in the probability density chart of the variable. Just for illustration purposes, we have arbitrarily created a discretization threshold at the median value in the following www.conradyscience.com | www.bayesia.com 9
  • 12. Microarray Analysis with Bayesian Networks and BayesiaLab screenshot. We have the ability to return to the Node Editor at any time and re-discretize (or change other attributes) of variables as needed. Supervised Learning: Augmented Markov Blanket As the starting point for the generation of our Bayesian network model, we will de ne Leukemia Type as the Target Variable. This can be done by right-clicking the Leukemia Type and selecting Set As Target Node. Double-clicking the node while pressing the “T” key will do the same. Beyond the ability to predict class membership of future samples based on their expression levels, we are very interested in nding a manageable subset of variables that can be used as predictors, i.e. we want perform a feature selection. The Markov Blanket algorithm is suitable for this kind of application and its speed is particularly helpful when dealing with thousands of variables.6 Furthermore, BayesiaLab offers the Augmented Markov Blanket, which starts with the Markov Blanket structure and then uses an unsupervised search to nd  the probabilistic relations that hold between 6 See appendix for a de nition of the Markov Blanket www.conradyscience.com | www.bayesia.com 10
  • 13. Microarray Analysis with Bayesian Networks and BayesiaLab each variable belonging to the Markov Blanket.7 This unsupervised search requires additional computation time but generally results in an improved predictive performance of the model. The learning process can be started by selecting Learning>Target Node Characterization>Augmented Markov Blanket from the menu. After a few seconds, we will see the result of the machine learning process. Our Target Node Leukemia Type is now connected to all variables in its Markov Blanket. To show these connections, we need to zoom out and at this level this individual nodes are barely visible dots on the screen. However, the connections in the Markov Blanket are very prominent and we can see that only 55 variables out of over 7,000 were selected as predictors. 7 Intuitively, the “augmented” part of the network plays the same role as the interaction terms in a regression. www.conradyscience.com | www.bayesia.com 11
  • 14. Microarray Analysis with Bayesian Networks and BayesiaLab Performance Evaluation In order to see whether this selection proves to be adequate for classi cation purposes, we switch into the Validation Mode by pressing the F5 key and start the performance evaluation. As we do not have a separate test and training set, we will need to use Cross-Validation for evaluation purposes. Cross- validation is a technique for assessing how the predictions of a model will generalize to an independent data set. One round of Cross-Validation involves partitioning a sample of data into complementary subsets, estimating the model on one subset, and then validating the analysis on the other subset. To reduce variability, multiple rounds of Cross- Validation are performed using different partitions, and the validation results are averaged over the rounds. To start the process, we select Tools>Cross Validation>Targeted from the menu. The performance report shows that all of the 47 ALL cases, 45 were correctly classi ed and of the 25 AML, 24 cases were correctly identi ed, too. This yields a total precision of almost 96%. Network Complexity Beyond precision, parsimony is a key objective in most modeling tasks. This need for simplicity is particularly obvious in the presence of thousands of potential predictors. Golub’s initial model used those 50 genes as predictors, which were www.conradyscience.com | www.bayesia.com 12
  • 15. Microarray Analysis with Bayesian Networks and BayesiaLab most correlated with the AML-ALL distinction. Our Augmented Markov Blanket model found 55 predictors, and, as it turns out, 14 out of the 55 predictors are in common with Golub’s list of the 50 most correlated variables. Although the selection of 55 predictors would perhaps be suf cient for a practical application, we can investigate whether a selection of fewer predictors can still yield reliable results. BayesiaLab allows us to manage network complex- ity via the Structural Coef cient (SC). By default, the value of SC is set to 1, however, we can increase its value to force a simpler network structure, which, in the case of our model, would mean fewer predictors. We will set SC arbitrarily to 2 and use the Augmented Markov Blanket learning again. The resulting network now only contains 16 predictors: To evaluate the performance of this network we will repeat the Cross-Validation. www.conradyscience.com | www.bayesia.com 13
  • 16. Microarray Analysis with Bayesian Networks and BayesiaLab Interestingly, despite the smaller number of predictors, the classi cation performance has actually improved slightly and now stands at 97%. Given this performance of a simpler model, we may wish to increase the SC further and see whether we will still have a reasonable model. Outside a case study and in a real-world situation, the analyst would have to make the determination as to how far to take this. Our pursuit of an even simpler structure only serves demonstration purposes. We now set SC=3 and once again learn the Augmented Markov Blanket. The number of predictors is now narrowed down to 5. www.conradyscience.com | www.bayesia.com 14
  • 17. Microarray Analysis with Bayesian Networks and BayesiaLab Repeating the Cross-Validation yields that the precision remains at the same level as before, i.e. at 97% www.conradyscience.com | www.bayesia.com 15
  • 18. Microarray Analysis with Bayesian Networks and BayesiaLab At rst glance it might be tempting to chose this very simple model with only ve predictors for practical application. However, practical considerations may actually suggest otherwise. Given the delicate nature of microarray analysis, missing values, measurement errors, etc. are to be expected. A model with more predictors, including redundant predic- tors, will be less sensitive to noise and thus more robust overall. Inference With the small number of predictors in our most recent network, we can inspect them more closely and review their states in the Monitor Panel. We can display their Monitors by rst selecting the nodes in the Graph Panel and then by right-clicking on Monitor from the contextual menu. We can now see their states, the thresholds between the states and their marginal distribution. So far, we have only made inference about the state of Leukemia Type based on the states of the predictors, i.e. P(Leu- kemia Type | gene).8 We can use a key property of Bayesian networks here, namely omnidirectional inference. This al- lows to compute the posterior probability of the states of the predictors, given the state of Leukemia Type: P(gene | Leukemia Type). 8 gene = (e1, e2, e3, …, en), i.e. a vector consisting of n expression levels. Here, n is the number of predictors. www.conradyscience.com | www.bayesia.com 16
  • 19. Microarray Analysis with Bayesian Networks and BayesiaLab For Leukemia Type=ALL, we obtain the following posterior distributions: Conversely, for Leukemia Type=AML, we obtain these posterior distributions: The very manageable number of predictors certainly makes it easy to further examine their speci c roles in the classi ca- tion model. BayesiaLab provides a number of tools that assist with interpreting the variables. Target Dynamic Pro le The Target Dynamic Pro le function is typically used to search for the states of predictor variables, which optimize the desired state of the target variable. In our case, however, we are interested in those states of the predictor variables, which provide the greatest amount of information for distinguishing between the states AML and ALL of the Leukemia Type variable. The function can be invoked by selecting Analysis>Reports>Target Dynamic Pro le from the menu. www.conradyscience.com | www.bayesia.com 17
  • 20. Microarray Analysis with Bayesian Networks and BayesiaLab In the following dialogue we select Probability, Criterion Maximization, Take Into Account the Joint Probability and Hard Evidence. The rst result shows ALL vs. AML. This means that observing the lowest state (1/3) of M23791 would change the probability of ALL from 65% (a priori) to 97.5% (a posteriori). Conversely, the probability of AML increases from 34% (a priori) to 91.5% (a posteriori), given that the higher state (2/ 2) of X95735 is observed. A second piece of evidence, the higher state (2/2) of U46499, would further increase the AML probability to 99%. www.conradyscience.com | www.bayesia.com 18
  • 21. Microarray Analysis with Bayesian Networks and BayesiaLab Bayes Factor BayesiaLab offers an additional metric for interpreting the impact of observing a speci c piece of evidence, namely by providing values of the Bayes Factor for each variable. We de ne the Bayes Factor here as ⎛ P(H E) ⎞ K = log 2 ⎜ ⎝ P(H ) ⎟ , ⎠ which, from a statistical perspective, can be seen as the strength of the observed evidence E with respect to the hypothe- sis H. From an information theory perspective, the Bayes Factor quanti es of the modi cation of the state variable un- certainty once we have the evidence E. For instance, after observing the evidence of M23197 = 1/3, which is the rst of three states (see green bar in the associ- ated Monitor), www.conradyscience.com | www.bayesia.com 19
  • 22. Microarray Analysis with Bayesian Networks and BayesiaLab we can observe the impact on all the other variables by selecting Analysis>Report>Evidence Analysis. The rightmost column shows the Bayes Factor. For the state ALL of variable Leukemia Type it is 0.5832 and for state AML it is -3.7942. This means that this observation provides much stronger evidence against AML versus positive evi- dence for ALL. As we noted earlier, inference within Bayesian networks is always omnidirectional, so we obtain Bayes Factors for all other variables in the network. Target Interpretation Tree So, observing the speci c states of any of these genes should update one’s belief about the likely state of Leukemia Type. BayesiaLab can also represent this sequence of “belief updating given evidence” in the form of a tree. We can generate such a tree by selecting Analysis>Target’s Interpretation Tree. To show this Target Interpretation Tree, we have limited the number of pieces of evidence to three and omitted the rest, so the tree can t on one page and still remain legible. By reading the tree from left to right, we can see how each piece of evidence updates the probabilities of AML versus ALL. For instance, starting at the root and observing the middle value (2/3) for M23197 we would follow the center www.conradyscience.com | www.bayesia.com 20
  • 23. Microarray Analysis with Bayesian Networks and BayesiaLab path and see that our new conditional probability for ALL is now 40.6%, i.e. it now lower than the original marginal probability of 65%. Observing the next evidence, U46499, can change the picture again. Given the lower value (1/2) for U46499, the conditional probability for ALL would increase to 98.5%, and so on. Conclusion We have demonstrated that, with Bayesian networks as the framework and BayesiaLab as a software tool, we can rap- idly generate a reliable classi cation model on the basis of gene expression data. Beyond the good classi cation performance, the Augmented Markov Blanket model provides an effective means of fea- ture selection among thousands of potential predictors, with a minimum of speci cation effort. Furthermore, the ex- treme speed of feature selection allows the researcher to move quickly from data acquisition to interpretation. Finally, the user-friendly interface of BayesiaLab makes the required work ow easily accessible to any subject matter expert and without requiring to write any program code, which is typically required in this domain. www.conradyscience.com | www.bayesia.com 21
  • 24. Microarray Analysis with Bayesian Networks and BayesiaLab www.conradyscience.com | www.bayesia.com 22
  • 25. Microarray Analysis with Bayesian Networks and BayesiaLab Appendix Markov Blanket In many cases, the Markov Blanket algorithm is a good starting point for any predictive model, whether used for scoring or classi cation. This algorithm is extremely fast and can even be applied to databases with thousands of variables and millions of records. The Markov Blanket for a node A is the set of nodes composed of A’s parents, its children, and its children’s other par- ents (=spouses). The Markov Blanket of the node A contains all the variables, which, if we know their states, will shield the node A from the rest of the network. This means that the Markov Blanket of a node is the only knowledge needed to predict the be- havior of that node A. Learning a Markov Blanket selects relevant predictor variables, which is particularly helpful when there is a large number of variables in the database (In fact, this can also serve as a highly-ef cient variable selec- tion method in preparation for other types of modeling, outside the Bayesian network framework). www.conradyscience.com | www.bayesia.com 23
  • 26. Microarray Analysis with Bayesian Networks and BayesiaLab Comparison of Classi cation Performance with Golub et al. (1999) In order to facilitate a direct comparison of our approach with the performance of the Golub model, we now constrain our data set to the initial 38 cases. Based on this smaller dataset, we learn the Augmented Markov Blanket, which selects 20 variables as predictors. Subsequently, we follow Golub’s approach of applying a leave-one-out cross-validation for performance evaluation. www.conradyscience.com | www.bayesia.com 24
  • 27. Microarray Analysis with Bayesian Networks and BayesiaLab We obtain a 97% precision, correctly classifying 37 out of the 38 cases. The Golub model correctly identi ed 36 cases and declared the remaining two cases as uncertain. Within our model, we cannot detect any “borderline” cases in terms of classi cation strength, which means that one case was indeed misclassi ed. www.conradyscience.com | www.bayesia.com 25
  • 28. Microarray Analysis with Bayesian Networks and BayesiaLab References Conrady, Stefan, and Lionel Jouffe. “Breast Cancer Diagnostics with Bayesian Networks”. Conrady Applied Science, LLC, March 5, 2011. http://www.conradyscience.com/index.php/wbcd. Dudoit, S., J. Fridlyand, and T. P Speed. “Comparison of discrimination methods for the classi cation of tumors using gene expression data.” Journal of the American statistical association 97, no. 457 (2002): 77–87. Friedman, N., M. Linial, I. Nachman, and D. Pe’er. “Using Bayesian networks to analyze expression data.” Journal of computational biology 7, no. 3-4 (2000): 601–620. Gentleman, Robert. “Reproducible Research: A Bioinformatics Case Study.” Statistical Applications in Genetics and Molecular Biology 4, no. 1 (2005). http://www.bepress.com/sagmb/vol4/iss1/art2. Golub, T. R. “Molecular Classi cation of Cancer: Class Discovery and Class Prediction by Gene Expression Monitor- ing.” Science 286, no. 5439 (1999): 531-537. Husmeier, Dirk. “Bayesian Networks for Analysing Gene Expression Data.” Bayesian Networks for Analysing Gene Expression Data, August 2001. http://www.bioss.ac.uk/~dirk/essays/GeneExpression/bayes_net.html. “Markov Blanket.” Wikipedia. http://en.wikipedia.org/wiki/Markov_blanket. Slonim, Donna K, Pablo Tamayo, Jill P Mesirov, Todd R Golub, Eric S Lander, and Eric S L. “Class Prediction and Dis- covery Using Gene Expression Data” 2000 (2000): 263--272. Torgo, Luis. Data Mining with R: Learning with Case Studies. 1st ed. Chapman and Hall/CRC, 2010. www.conradyscience.com | www.bayesia.com 26
  • 29. Microarray Analysis with Bayesian Networks and BayesiaLab Contact Information Conrady Applied Science, LLC 312 Hamlet’s End Way Franklin, TN 37067 USA +1 888-386-8383 info@conradyscience.com www.conradyscience.com Bayesia SAS 6, rue Léonard de Vinci BP 119 53001 Laval Cedex France +33(0)2 43 49 75 69 info@bayesia.com www.bayesia.com Copyright © 2011 Conrady Applied Science, LLC and Bayesia SAS. All rights reserved. Any redistribution or reproduction of part or all of the contents in any form is prohibited other than the following: • You may print or download this document for your personal and noncommercial use only. • You may copy the content to individual third parties for their personal use, but only if you acknowledge Conrady Applied Science, LLC and Bayesia SAS as the source of the material. • You may not, except with our express written permission, distribute or commercially exploit the content. Nor may you transmit it or store it in any other website or other form of electronic retrieval system. www.conradyscience.com | www.bayesia.com 27