MSc Thesis - CML Survival Prediction-Final Correction

CHAPTER ONE
INTRODUCTION
1.1 Background to the Study
According to American Cancer Society (2015) Chronic Myeloid Leukaemia
(CML) is a type of cancer that affects the blood cells of living organisms. The body is
made up of trillions of living cells. Normal body cells grow, divide to make new
cells, and die in an orderly way. During the early years of a person‘s life, normal cells
divide only to replace worn-out, damaged, or drying cells. Cancer begins when cells
in a part of a body start to grow out of control (Cortes et al., 2011). There are many
kinds of cancer, but they all start because of this out-of-control growth of abnormal
cells. Cancer growth is different from normal cell growth. Instead of dying, cancer
cells keep on growing and form new cancer cells. These cancer cells can grow into
(invade) other tissues, something that normal cells cannot do (Cortes et al., 2012).
Being able to grow out of control and invade other tissues is what makes a cell a
cancer cell. In most cases, the cancer cells form a tumor. But some cancers, like
Leukaemia, rarely form tumors. Instead, these cancer cells are in the blood and bone
marrow. When cancer cells get into the bloodstream or lymph vessels, they can travel
to other parts of the body (Kantarjian and Cortes, 2008). There they begin to grow
and form new tumors that replace normal tissues. This process is called metastasis.
Leukaemia is a type of cancer that starts in cells that form new blood cells.
These cells are found in the soft, inner part of the bones called bone marrow. Chronic
myeloid Leukaemia (CML), also known as chronic myelogenous Leukaemia, is a
fairly slow growing cancer that starts in the bone marrow. It is a type of cancer that
affects the myeloid cells – cells that form blood cells, such as red blood cells,

2
platelets, and many type of white blood cells. In CML, Leukaemia cells tend to build
up in the body over time. In many cases, people do not have any symptoms for at
least a few years. CML can also change into a fast growing, acute Leukaemia that
invades almost any organ in the body. Most cases of CML occur in adults, but it is
also very rarely found in children. As a rule, their treatment is the same as that for
adults.
According to Durosinmi et al. (2008), chronic myeloid Leukaemia (CML) has
an annual worldwide incidence of 1/100,000 with a male - female ratio of 1.5:1. The
median age of the disease incidence is about 60 years (Deninger and Druker, 2003).
In Nigeria and other African countries with similar demographic pattern, the median
age of the occurrence of CML is 38 years (Boma et al., 2006; Okanny et al., 1989).
In the United States of America (USA) however, the incidence of CML in the age
group under 70 years is higher among the African-Americans than among any other
racial/ethnic groups (Groves et al., 1995). It is probable that a combination of
environment and as yet unknown biological factors may account for the differential
age incidence pattern of CML between the Blacks and other races in the USA.
According to Oyekunle et al. (2012a), pediatric CML is rare, accounting for less than
10% of all cases of CML and less than 3% of all pediatric Leukaemia. Incidence
increases with age being exceptionally rare in infancy, it is about 0.7 per million/year
at ages 1 -14 years and rising to 1.2 per million/year in adolescents worldwide (Lee
and Chung, 2011).
To date, only allogenic stem cell transplantation (SCT) remains curative for
chronic myeloid leukaemia (Robin et al., 2005), though its role has waned
significantly in recent times due to the effectiveness of the tyrosine kinase inhibitors
(TKIs) (Oyekunle et al. 2011; Oyekunle et al., 2012b). Although potentially curative,

3
SCT is associated with significant morbidity and mortality (Gratwohl et al., 1998).
Alpha interferon-based regimens adequately control the chronic phase of the disease,
but result in few long term survivors (Bonifazi et al., 2001). Advances in targeted
therapy resulted in the discovery of Imatinib mesylate, a selective competitive
inhibitor of BCR – ABL protein tyrosine kinase, which has demonstrated to induce
both hematologic and cytogenetic remission in a significant proportion of CML
patients (Kantarjian et al., 2002). A number of prognostic scoring systems have been
developed for patients with CML, of which Sokal and Hasford (or Euro) scores are
most popular (Gratwohl et al., 2006). The Sokal score was generated using chronic
phase CML patients treated with busulphan or hydroxyurea (Sokal et al., 1984), while
the Hasford score was derived and validated, using patients treated with Interferon-
alpha (Hasford et al., 1998).
Survival Analysis deals with the application of methods to estimate the
likelihood of an event (death, survival, decay, child-birth etc.) occurring over a
variable time period (Dimitologlou et al., 2012); in short, it is concerned with
studying the time between entry to a study and a subsequent event (such as death).
The traditional statistical methods applied in the area of survival analysis include the
Kaplan-Meier (KM) estimator curve (Kaplan and Maier, 1958) and the Cox-
proportional hazard (PH) models (Cox, 1972). These methods apply parametric
methods in estimating survival parameters for a group of individuals. Other methods
applied in traditional statistical methods also include the use of non-parametric
models. The Kaplan-Meier method allows for an estimation of the proportion of the
population of people who survive a given length of time under some circumstances.
The cox model is a statistical technique for exploring the relationship between the
survival of a patient and several explanatory variables. Before the advent of Imatinib

4
as a treatment option for Chronic Myeloid Leukaemia (CML), the median survival
time for CML has been 3 – 5 years from the time of diagnosis of the disease (Hosseini
and Ahmadi, 2013). According to Gambacorti-Passerimi et al. (2011), a follow-up of
832 patients using Imatinib showed an overall survival rate of 95.2% after 8 years. A
10-year follow-up of 527 patients in Nigeria undergoing Imatinib treatment showed
an overall survival rate of 92% and 78% after 2 and 5 years respectively (Oyekunle et
al., 2013).
Machine learning (ML) is a branch of artificial intelligence that allows
computers to learn from past examples of data records (Quinlan, 1986; Cruz and
Wishart, 2006). Unlike traditional explanatory statistical modeling techniques,
machine learning does not rely on prior hypothesis (Waijee et al., 2013a). Machine
learning has found great importance in the area of predictive modeling in medical
research especially in the area of risk assessment, risk survival and risk recurrence.
Machine learning techniques can be broadly classified into: supervised and
unsupervised techniques; the earlier involves matching a set of input records to one
out of two or more target classes while the latter is used to create clusters or attribute
relationships from raw, unlabeled or unclassified datasets (Mitchell, 1997).
Supervised machine learning algorithms can be used in the development of
classification or regression models. Classification model is a supervised approach
aimed at allocating a set of input records to a discrete target class unlike regression
which allocates a set of records to a real value. This research is focused at using
classification models to classify patients‘ survival as either of survived or not
survived.
Feature selection methods are unsupervised machine learning techniques used
to identify relevant attributes in a dataset. It is important in identifying irrelevant and

5
redundant attributes that exist within a dataset which may increase computational
complexity and time (Yildirim, 2015; Hall, 1999). Feature selection methods are
broadly classified as filter-based, wrapper-based and embedded methods while filter-
based methods are chosen for this study due to the ability to identify relevant
attributes with respect to the target class – CML patient survival unlike wrapper-based
methods which use the performance of the machine learning algorithms. Filter-based
feature selection methods was used to identify the most relevant variables that are
predictive for CML patients survival from the variables monitored during the follow-
up of Imatinib treatment administered to Nigerian CML patients. The relevant
features proposed using feature selection were used to formulate the predictive model
for CML patients‘ survival classification using supervised machine learning
techniques.
1.2 Statement of Research Problem
Chronic Myeloid Leukaemia (CML) is a very serious disease affecting
Nigerians with just one referral government hospital in Nigeria which administers
Imatinib treatment but with limited number of experts compared to the number of
cases attended to. In Nigeria, hematologists rely on scoring models proposed using
datasets belonging to Caucasian (white race) and/or non-African CML patients
undergoing treatment before the Imatinib era (e.g. Sokal used busulphan or
hydroxyurea, Hasford used Interferon Alfa and European Treatment and Outcome
Study). These models have been deemed ineffective on Nigerian CML patients who
are undergoing Imatinib treatment and as such there is presently no existing predictive
model in Nigeria specifically for the survival of CML patients undergoing Imatinib
treatment. There is a need for a predictive model which will aid clinical decisions

6
concerning continual treatment or alternative action affecting the survival of CML
patients receiving Imatinib treatment, hence this study.
1.3 Aim and Objectives of the Study
The aim of this research is to develop a predictive model which identifies the relevant
attributes required for classifying the survival of Chronic Myeloid Leukemia patients
receiving Imatinib treatment in Nigeria using machine learning techniques. The
specific research objectives are to:
i. elicit knowledge on the variables monitored during the follow-up of
Imatinib treatment;
ii. propose the variables predictive for CML survival from (i) and use them to
formulate the predictive model;
iii. simulate the predictive model formulated in (ii); and
iv. validate the model in (iii) using historical data.
1.4 Research Methodology
In order to achieve the above listed objectives, the methodological approach for this
study was performed using the following methods.
Formal interview was conducted with two (2) Hematologists to identify
parameters used to monitor survival and anonymous and validated information about
patients will be collected.
Filter-based feature selection methods were used to identify the most relevant
variables (prognostic factors) predictive for survival from the variables identified for
which the predictive model was formulated using supervised machine learning
algorithms.

7
The formulated model was simulated using the explorer interface of the
Waikato Environment for Knowledge Analysis (WEKA) software – a light-weight
java based suite of machine learning tools using the preprocess, classify and select
attribute packages.
The collected historical data was used to validate the performance of the
model by determining the confusion matrix, recall, precision, accuracy and the area
under the Receiver Operating Characteristics (ROC) curve.
1.5 Research Justification
The Nigerian Health sector has set ambitious targets for providing essential
health services to all citizens; improving the quality of decisions affecting treatment
options is very essential to reducing disease mortality rates in Nigeria. Predictive
models for Chronic Myeloid Leukaemia (CML) survival classification can help
identify the most relevant variables for patient survival and thus allow physicians
concentrate on a smaller number but important variables during clinical observations.
1.6 Scope and Limitations of the Study
This study is limited to the classification of 2-year and 5-year survival of
Nigerian Chronic Myeloid Leukaemia (CML) receiving follow-up for Imatinib
treatment at Obafemi Awolowo University Teaching Hospital Complex (OAUTHC),
Ile-Ife, Osun State. Also, the datasets used for this study was based on information
collected from a single centre and relatively limited number of CML patients.
1.7 Organization of Thesis
The first chapter of this thesis has been presented and the organization of the
remaining chapters is discussed in the following paragraphs.

8
Chapter two contains the Literature Review which consists of an introduction
to chronic myeloid Leukaemia (CML), its etiology, treatment and distribution around
the World, Africa and Nigeria; survival analysis and the existing stochastic methods
(Kaplan-Meier and the Cox proportional hazard models); Machine learning –
supervised, unsupervised and application of machine learning in healthcare; Feature
selection methods; Existing survival models and related works.
Chapter three contains the Research Methodology which consists of the
research framework, data collection methods, data identification and variable
description, feature selection results, model formulation methods – supervised
machine learning algorithms proposed, model simulation and the performance
evaluation metrics to be used.
Chapter four contains the Results and discussions which consists of the
descriptive statistics of the data collected from the referral hospital, feature selection
results and discussions, simulation results and discussions and the performance
evaluation of the machine learning algorithms used.
Chapter five contains the summary, conclusion, recommendations and the
possible future works of the study.

9
CHAPTER TWO
LITERATURE REVIEW
2.1 Chronic Myeloid Leukaemia (CML)
According to DeAngelo and Ritz (2004), chronic myeloid leukaemia (CML) is
a clonal hematopoietic disorder characterized by the reciprocal translocation
involving chromosomes 9 and 22. As a result of this translocation, a novel fusion
gene, BCR-ABL is created and constitutive activity of this tyrosine kinase plays a
central role in the pathogenesis of the disease process. Cancer begins when cells in a
part of the body start to grow out of control. There are many kinds of cancer, but they
all start because of this out-of-control growth of abnormal cells (NCCN, 2014).
Cancer cell growth is different from normal cells growth. Instead of dying, cancer
cells keep on growing and form new ones. These cancer cells can grow into (invade)
other tissues, something that normal cells cannot do (National Cancer Institute NCI,
2011). Being able to grow out of control and invade other tissues is what makes a cell
a cancer cell. In most cases the cancer cells form a tumor. But some cancers, like
Leukaemia, rarely form tumors. Instead, these cancer cells are in the blood and bone
marrow (see Figure 2.1 for family tree of blood cells).
Leukaemia is a group of cancers that usually begins in the bone marrow and
results in high number of abnormal white blood cells (NCI, 2011). These white blood
cells are not fully developed and are called blasts or Leukaemia cells. Symptoms may
include bleeding and bruising problems, feeling very tired, fever and an increased risk
of infections.

10
Figure 2.1: Family Tree of Blood Cells
(Source: NCCN, 2014)

11
These symptoms occur due to a lack of normal blood cells. Diagnosis is typically by
blood tests or bone marrow biopsy. Clinically and pathologically, Leukaemia is
subdivided into a variety of large groups.
The first division is between its acute and chronic forms:
Acute Leukaemia is characterized by a rapid increase in the number of
immature blood cells (Locatelli and Niemeyer, 2015). Crowding due to such cells
makes the bone marrow unable to produce healthy blood cells. Immediate treatment is
required in acute Leukaemia due to the rapid progression and accumulation of the
malignant cells, which then spill over into the bloodstream and spread to other organs
of the body (Dohner et al., 2015). Acute forms of Leukaemia are the most common
forms of Leukaemia in children.
Chronic Leukaemia is characterized by the excessive buildup of relatively
mature, but still abnormal, blood cells. Typically taking months or years to progress,
the cells are produced at a much higher rate than normal, resulting in many abnormal
white blood cells (Shen et al., 2007). Whereas acute Leukaemia must be treated
immediately, chronic forms are sometimes monitored for some time before treatment
to ensure maximum effectiveness of therapy (Provan and Gribben, 2010). Chronic
Leukaemia mostly occurs in older people, but can theoretically occur in any age
group.
Additionally, the diseases are subdivided according to which kind of blood
cell is affected (Hira et al., 2014). This split divides Leukaemias into lymphoblastic or
lymphocytic Leukaemias and myeloid or myelogenous Leukaemias (Table 2.1):

12
Table 2.1: The Four Major Kinds of Leukaemia
Cell Type Acute Chronic
Lymphocytic Leukaemia
(or lymphoblastic)
Acute lymphoblastic
Leukaemia (ALL)
Chronic lymphocytic
Leukaemia (CLL)
Myelogenous Leukaemia
(myeloid or
nonlymphocytic)
Acute myelogenous
Leukaemia (AML or
myeloblastic)
Chronic myelogenous
Leukaemia (CML)
(Source: Hira et al., 2014)

13
In lymphoblastic or lymphocytic Leukaemias, the cancerous change takes
place in a type of marrow cell that normally goes on to form lymphocytes, which are
infection-fighting immune system cells. Most lymphocytic Leukaemias involve a
specific subtype of lymphocyte, the B cell. In myeloid or myelogenous Leukaemias,
the cancerous change takes place in a type of marrow cell that normally goes on to
form red blood cells, some other types of white cells, and platelets.
Chronic myelogenous (or myeloid or myelocytic) Leukaemia (CML), also
known as chronic granulocytic Leukaemia (CGL), is a cancer of the white blood cells.
It is a form of Leukaemia characterized by increased and unregulated growth of
predominantly myeloid cells in the bone marrow and the accumulation of these cells
in the blood. CML is a clonal bone marrow stem cell disorder in which a proliferation
of mature granulocytes (neutrophils, eosinophils and basophils) and their precursors is
found. It is a type of myeloproliferative disease associated with a characteristic
chromosomal translocation called the Philadelphia chromosome (Figure 2.2).
Chronic myeloid Leukaemia (CML) is defined by the presence of the Philadelphia
chromosome (Ph) which arises from the reciprocal translocation of the ABL1 and
BCR genes on chromosome 9 and 22 respectively (Oyekunle et al., 2012c).
CML is characterized by the proliferation of a malignant clone containing the
BCR-ABL1 mutant fusion gene resulting in myeloid hyperplasia and peripheral blood
leucocytosis and thrombocytosis. It is believed that pediatric CML is rare, accounting
for less than 10% of all cases of CML and less than 3% of all pediatric Leukaemias
(Lee and Chung, 2011). Incidence increases with age being exceptionally rare in
infancy, it is about 0.7 per million/year at ages 1 – 14 years and rising to 1.2 per
million/year in adolescents (National Cancer Institute (NCI), 2011).

14
Figure 2.2: Philadelphia Chromosome and BCR-ABL gene

15
Generally, children are diagnosed at a median age of 11 – 12 years (range, 1 –
18 years) with approximately 10% presenting in advanced phases (Suttorp and Millot,
2010).
2.1.1 CML diagnosis
According to the National Comprehensive Cancer Network (NCCN)
Guideline for Patients on CML (2014), in order to diagnose Chronic Myeloid
Leukaemia (CML), doctors use a variety of tests to analyze the blood and marrow
cells. This is because there are no special tests used in diagnosing CML. The best
form of diagnosis is the early report of symptoms. The following are a number of
tests useful in diagnosing CML in patients.
a. Complete Blood Count (CBC)
This test is used to measure the number and types of cells in the blood.
According to Tefferi et al. (2005), people with CML often have: decreased
hemoglobin concentration, increased white blood cell count, often to very high levels
and possible increase or decrease in the number of platelets depending on the severity
of the person‘s CML. Blood cells are stained (dyed) and examined with a light
microscope. These samples show a: specific pattern of white blood cells, small
proportion of immature cells (leukemic blast cells and promyelocytes) and larger
proportion of maturing and fully matured white blood cells (myelocytes and
neutrophils). These blast cells, promyelocytes and myelocytes are normally not
present in the blood of healthy individuals.
b. Bone Marrow Aspiration and Biopsy
These tests are used to examine marrow cells to find abnormalities and are
generally done at the same time (Raanani et al., 2005). The sample is usually taken
from the patient‘s hip bone after medicine has been given to numb the skin (Figure

16
2.3). For a bone marrow aspiration, a special needle is inserted through the hip bone
and into the marrow to remove a liquid sample of cells. For a bone marrow biopsy, a
special needle is used to remove a core sample of bone that contains marrow. Both
samples are examined under a microscope to look for chromosomal and other cell
changes (Vardiman et al., 2001).
c. Cytogenetic Analysis
This test measures the number and structure of the chromosomes. Samples
from the bone marrow are examined to confirm the blood test findings and to see if
there are chromosomal changes or abnormalities, such as the Philadelphia (Ph)
chromosome (Cortes et al., 1995). The presence of the Ph chromosome (the shortened
chromosome 22) in the marrow cells, along with a high white blood cell count and
other characteristic blood and marrow test findings, confirms the diagnosis of CML.
The bone marrow cells of some people with CML have a Ph chromosome detectable
by cytogenetic analysis (Aurich et al., 1998). A small percentage of people with
clinical signs of CML do not have cytogenetically detectable Ph chromosome, but
they almost always test positive for the BCR-ABL fusion gene on chromosome 22
with other types of tests.
d. FISH (Fluorescence In Situ Hybridization)
FISH is a more sensitive method for detecting CML than the standard
cytogenetic tests that identify the Ph chromosome (Mark et al., 2006; Tkachuk et al.,
1990). FISH is a quantitative test that can identify the presence of the BCRABL gene
(Figure 2.4). Genes are made up of DNA segments. FISH uses color probes that bind
to DNA to locate the BCR and ABL genes in chromosomes. Both BCR and ABL
genes are labeled with chemicals each of which releases a different color of light
(Landstrom and Tefferi, 2006).

17
Figure 2.3: Bone marrow biopsy

18
Figure 2.4: Identifying the BCR-ABL Gene Using FISH

19
The color shows up on the chromosome that contains the gene— normally
chromosome 9 for ABL and chromosome 22 for BCR—so FISH can detect the piece
of chromosome 9 that has moved to chromosome 22. The BCR-ABL fusion gene is
shown by the overlapping colors of the two probes. Since this test can detect BCR-
ABL in cells found in the blood, it can be used to determine if there is a significant
decrease in the number of circulating CML cells as a result of treatment.
e. Polymerase Chain Reaction (PCR)
The BCR-ABL gene is also detectable by molecular analysis. A quantitative
PCR test is the most sensitive molecular testing method available. This test can be
performed with either blood or bone marrow cells (Branford et al., 2008). The PCR
test essentially increases or ―amplifies‖ small amounts of specific pieces of either
RNA or DNA to make them easier to detect and measure. So, the BCR-ABL gene
abnormality can be detected by PCR even when present in a very low number of cells
(Hughes et al., 2003). About one abnormal cell in one million cells can be detected by
PCR testing. Quantitative PCR is used to determine the relative number of cells with
the abnormal BCR-ABL gene in the blood (Hughes et al., 2003). This has become
the most used and relevant type of PCR test because it can measure small amounts of
disease, and the test is performed on blood samples, so there is no need for a bone
marrow biopsy procedure. Blood cell counts, bone marrow examinations, FISH and
PCR may also be used to track a person‘s response to therapy once treatment has
begun (Ohm, 2013). Throughout treatment, the number of red blood cells, white blood
cells, platelets and CML cells is also measured on a regular basis.
2.1.2 Phases of chronic myeloid Leukaemia
Staging is the process of finding out how far a cancer has spread. Most types
of cancer are staged based on the size of the tumor and how far it has spread from

20
where it started. This system does not work for Leukaemias because they do not often
form a solid mass or tumor (Vardiman et al., 2001; Millot et al., 2005). Also,
Leukaemia starts in the bone marrow and, in many people; it has already spread to
other organs when it is found. For someone with chronic myeloid Leukaemia (CML),
the outlook depends on other factors such as features of the cells shown in lab tests,
and the results of imaging studies (Raanani et al., 2005). These information help
guide treatment decisions.
In Chronic myeloid Leukaemia, there are three phases. As the amount of blast
cells increases in the blood and bone marrow, there is less room for healthy white
blood cells, red blood cells and platelets. This may result in infections, anemia and
easy bleeding as well as bone pain and pain or a feeling of fullness below the ribs on
the left side. The number of blasts cells in the blood and bone marrow and the
severity of signs and symptoms determine the phase of the disease (NCI, 2011). The
three (3) phases of CML are:
a. Chronic Phase;
b. Accelerated Phase; and
c. Blast Crisis Phase.
A number of patients progress from chronic phase, which can usually be well-
managed, to accelerated phase or blast crisis phase. This is because there are
additional genetic changes in the leukemic stem cells. Some of these additional
chromosome abnormalities are identifiable by cytogenetic analysis (Cortes et al.,
1995a; Aurich et al., 1998). However, they appear to be other genetic changes (low
levels of drug-resistant mutations that may be present at diagnosis) in the CML stem
cells that cannot be identified by the laboratory tests that are currently available.

21
a. Chronic Phase
The chronic phase is the first phase of CML, the number of white blood cells
is increased and immature white blood cells (blasts) make up less than 10% of cells in
the peripheral blood and/or bone marrow. This means that less than 10 out of every
100 cells are blasts. CML in the chronic phase may cause mild symptoms, but most
often it does not cause any symptom. Possible symptoms include feeling infections
since the changes in the blood cells are not severe. In this phase, the cancer
progresses very slowly. Thus, CML in this phase may progress over several months
or years. In general, people with CML in the chronic phase respond better to
treatment.
b. Accelerated Phase
The accelerated phase is the second phase of CML. In this phase, the number
of blast cells in the peripheral blood and/or bone marrow is usually higher than
normal. Other aspects of accelerated phase can include increased basophils, very low
platelets, or new chromosome changes. The number of white blood cells is also high.
In this phase, the Leukaemia cells grow more quickly and may cause symptoms such
as anemia and an enlarged spleen. A few different criteria groups can be used to
define the accelerated phase. However, the two most commonly used are the World
Health Organization Criteria and the criteria from MD Anderson Cancer Centre
(Table 2.2)
c. Blast Crisis Phase
The blast phase is the final phase of CML progression. Also referred to as
‗Blast Crisis‖, CML in this phase can be life-threatening (NCCN, 2014). There are
two criteria groups that may be used to define the blast phase (Table 2.3).

22
Table 2.2: Criteria for Accelerated Phase
MD Anderson World Health Organization
15% blasts in peripheral blood
10% to 19% blast in peripheral blood
and/or bone marrow
30% blasts and promyelocytes in
peripheral blood
20% basophils in peripheral blood
20% basophils in peripheral blood
Very high or very low platelet count that is
unrelated to treatment
Very low platelet count that is
unrelated to treatment
Increasing spleen size and white blood
cells count despite treatment
New chromosome changes (mutations) New chromosome changes (mutations)
(Source: Faderi et al., 1999; Swerdlow et al., 2008)

23
Table 2.3: Criteria for Blast Phase
World Health Organization
International Bone Marrow
Transplant Registry
20% blasts in peripheral blood or bone
marrow
30% blasts in peripheral blood or bone
marrow
Blasts found outside of blood or bone
marrow
Blasts found outside of blood or bone
marrow
Large groups of blasts found in bone
marrow
(Source: Swerdlow et al., 2008; Druker, 2007)

24
In this phase, the number of blast cells in the peripheral blood and/or bone
marrow is very high. Another defining feature of blast phase is that the blast cells
have spread outside the blood and/or bone marrow into other tissues (Swerdlow et al.,
2008). In the blast phase, the Leukaemia cells may be more similar to Acute myeloid
Leukaemia (AML) or Acute lymphoblastic Leukaemia (ALL). AML causes too many
mature white blood cells called myeloblasts to be made. ALL results in too many
immature white blood cells called lymphoblast.
2.1.3 Chronic myeloid Leukaemia (CML) Treatment
There is more than one treatment for chronic myeloid Leukaemia; the type of
treatment being dependent on factors such as: age, general health, and the phase of
cancer. Some people with CML may have more than one treatment (Talpaz et al.,
2006; Kantarjian et al., 2006; Cortes et al., 2007). Primary treatment is the main
treatment used to rid the body of cancer. Tyrosine Kinase Inhibitors (TKIs) are often
used as primary treatment for CML. First-line treatment is the first set of treatments
given to CML patients and if the treatment fails, second-line treatment is the next
treatment or set of treatments given (Lichtman et al., 2006). This is also referred to as
follow-up treatment since it is given after follow-up tests show that the previous
treatment failed or stopped working.
a. Tyrosine kinase inhibitor (TKI) therapy
TKI (tyrosine kinase inhibitor) therapy is a type of targeted therapy used to
treat CML. Targeted therapy is treatment with drugs that target a specific or unique
feature of cancer cells not generally present in normal cells; as a result of targeting
cancer cells, they may be less likely to harm normal cells throughout the body
(Jabbour et al., 2012). TKI target the abnormal BCR-ABL protein that causes the
overgrowth of abnormal white blood cells (CML cells). The BCR-ABL protein, made

25
by the BCR-ABL gene, is a type of protein called a tyrosine kinase. Tyrosine kinases
are proteins located on or near the surface of cells and they tell cells when to grow
and divide to make new cells (Jabbour et al., 2007). TKIs block (inhibit) the BCR-
ABL protein from sending the signals that cause too many abnormal white blood cells
to form. However, each TKI works in a different way.
The FDA (Food and Drug Administration) approved the first TKI for the
treatment of CML in 2001. Since then, several TKIs have been developed to treat
CML. The newer drugs are referred to as Second-generation TKIs. The TKIs used to
treat CML are listed in Table 2.4; the drugs made in the form of pills are swallowed
by a patient. The dose of the drug is measured in mg (milligrams).
Imatinib was the first TKI to be approved to treat CML. Thus, it is called a first-
generation TKI. Imatinib works by binding to the active site on the BCR-ABL
protein to block it from sending signals to make new abnormal white blood cells
(CML cells). Figure 2.5 shows how Imatinib treatment works.
Dasatinib is a second-generation TKI that was approved for the treatment of CML in
2006. Dasatinib is more potent than Imatinib and can bind to the active and inactive
sites on the BCR-ABL protein to block growth signals.
Nilotinib was approved to treat CML in 2007. It is a second-generation TKI that
works in almost the same way as Imatinib. However, Nilotinib is more potent than
Imatinib and it more selectively targets the BCR-ABL protein. Nilotinib also targets
other protein apart from BCR-ABL protein.
Bosutinib was approved to treat CML in 2012. However, this second-generation TKI
is only approved to treat patients who experienced intolerance or resistance to prior
TKI therapy. It also targets other protein the same way as Nilotinib.

26
Table 2.4: Tyrosine Kinase Inhibitor (TKI) drugs used to treat CML
Generic name Brand name
(sold as)
Approved for
Imatinib Gleevec®
First-line treatment for:
1. Newly diagnosed adults and children
in chronic phase
2. Adults un chronic, accelerated or
blast phase after failure of
interferon-alfa therapy
Dasatinib Sprycel®
1. Newly diagnosed adults in chronic
phase
2. Adults resistant or intolerant to prior
therapy in chronic, accelerated or
blast phase
Nilotinib Tasigna®
1. Newly diagnosed adults in the
chronic phase
2. Adults resistant or intolerant to prior
therapy in chronic or accelerated
phase
Bosutinib Bosulif®
Second-line treatment for:
1. Adults with chronic, accelerated or
blast phase with resistance or
intolerance to prior therapy.

27
Figure 2.5: How Imatinib works

28
Side effects are new or worse unplanned physical or emotional conditions caused by
treatment. Each TKI for CML can cause side effects which depend on: the drug, the
amount taken, the length of treatment, and the person; most side effects can be
managed or even prevented. Supportive care is the treatment of symptoms caused by
CML or side effects caused by CML treatment.
b. Immunotherapy
The immune system is the body‘s natural defense against infection and
disease. Immunotherapy is treatment with drugs that boost the immune system
response against cancer cells (Sharma et al., 2011). Interferon is a substance naturally
made by the immune system. Interferon can also be made in a laboratory to be used
as immunotherapy for CML. PEG (pegylated) interferon is a long-acting form of the
drug. Interferon is not recommended as a first-line treatment option for patients with
newly diagnosed CML. But, it may be considered for patients unable to tolerate TKIs
(NCCN, 2014). Interferon is often given as a liquid that is injected under the skin or
in a muscle with a needle.
c. Chemotherapy
Chemotherapy is a type of drug commonly used to treat cancer. Many people
refer to this treatment as ―chemo‖. Chemotherapy drugs kill cells that grow rapidly,
including cancer cells and normal cells. Different types of chemotherapy drugs attack
cancer cells in different ways. Therefore, more than one drug is often used (Bluhm,
2011).
Omacetaxine is one of the chemotherapy drug used for CML treatment and
approved in 2012 by the FDA for patients in resistance and/or intolerant to two or
more TKIs. Resistance is when a CML patient does not respond to treatment;
intolerance is when treatment with a drug must be stopped due to severe side effects.

29
Omacetaxine works in part by blocking cells from making some of the proteins, such
as the BCR-ABL protein, needed for cell growth and division. This may slow or even
stop the growth of new CML cells.
Omacetaxine is administered as a liquid that is injected under the skin with a
needle. Other chemotherapy drugs may be given as a pill that is swallowed (NCCN,
2014). Chemotherapy is given in cycles of treatment days followed by days of rest.
The number of treatment days per cycle and the total number of cycles varies
depending on the chemotherapy drug given.
d. Stem cell transplant and donor lymphocyte infusion
An HSCT (hematopoietic stem cell transplant) is a medical procedure that
kills damaged or diseased blood stem cells in the body and replaces them with healthy
stem cells. HSCT is currently the only treatment for CML that may cure rather than
control the cancer. However, the excellent results with TKIs have challenged the role
of HSCT as first-line of treatment – the first set of treatments given to treat a disease.
For the treatment of CML, healthy blood stem cells are collected from another
person, called a donor. This is called an allogenic HSCT. An allogenic HSCT creates
a new immune system for the body. The immune system is the body‘s natural defense
against infection and disease. For this type of transplant, Human Leukocyte Antigen
(HLA) testing is needed to check if the patient and donor is a good match.
A Donor Lymphocyte Infusion (DLI) is a procedure in which the patient
receives lymphocytes from the same person who donated blood stem cells for the
HSCT. A lymphocyte is a type of white blood cells that helps the body to fight
infections. The purpose of the DLI is to stimulate an immune response called the
Graft-versus-tumor (GVT) effect of Graft-versus-Leukaemia (GVL) effect. The GVT

30
effect is when the transplanted cells (the graft) see the cancer cells (tumor/Leukaemia)
as foreign and attack them. This treatment may be used after HSCT for CML that
didn‘t respond to the transplant or that came back after an initial response.
2.1.4 Measuring CML treatment response
Measuring the response to treatment with blood and bone marrow testing is a
very important part of treatment for people with CML. In general terms, the greater
the response to drug therapy, the longer the disease will be controlled. Other factors
that affect a person‘s response to treatment include: the stage of the disease and the
features of the individual‘s CML at the time of diagnosis
Nearly all people with chronic phase CML have a ―complete hematologic
response‖ with Gleevec, Sprycel or Tasigna therapy; most of these people will
eventually achieve a ―complete cytogenetic response.‖ Patients who have a complete
cytogenetic response often continue to have a deeper response and achieve a ―major
molecular response.‖ Additionally, a growing number of patients achieve a ―complete
molecular response‖; table 2.5 shows the explanation of each term.
2.1.5 Imatinib treatment for Nigerian CML patients
According to Oyekunle et al. (2012b), Nigerian CML patients are presently
treated using Imatinib as the first line of treatment. Chromosome analysis is done
using cultured bone marrow aspirate samples; Philadelphia chromosomes are
estimated from the metaphase and the proportion of Ph+ cells are noted. Patients in
the chronic phase receive oral Imatinib: 400mg daily while those in the accelerated or
blastic phase receive 600mg daily. Imatinib is continued for as long as there is
evidence of continued benefit from therapy.

31
Table 2.5: Chronic Myeloid Leukaemia (CML) Treatment Responses
Type of Response Features Test Used to Measure Response
Hematologic Complete Hematologic
Response (CHR)
 Blood counts completely return to normal
 No blasts in peripheral blood
 No signs/symptoms of disease – spleen
returns to normal size
Complete Blood Count (CBC)
with differential
Cytogenetic Complete Cytogenetic Response
(CCyR)
 No Philadelphia (Ph) chromosomes
detected
Bone marrow cytogenetics
Partial Cytogenetic Response
(PCyR)
 1% - 35% of cells have Ph chromosome
Major Cytogenetic Response  0% - 35% of cells have Ph chromosome
Minor Cytogenetic Response  More than 35% of cells have Ph
chromosome
Molecular Complete Molecular Response
(CMR)
 No BCR-ABL gene detectable Quantitative PCR (QPCR) using
International Scale (IS)
Major Molecular Response
(MMR)
 At least a 3-log reduction1
in BCR-ABL
levels or BCR-ABL 0.1%
1
A 3-log reduction is a 1/1,000 or 1/1,000-fold reduction of the level at the start of treatment
31

32
Allopurinol (300mg daily) is given until leucocyte count fall below 20 x
109
/L. Patients with hyperleucocytosis (leucocyte count > 100 x 109
/L), and on
hydroxyurea continue on the latter for another 1 – 3 weeks, with monitoring of the full
blood count before final withdrawal of the drug, when the white cell count fall to less
than 100 x 109
/L.
In individuals with severe Imatinib-induced myelosuppression, the drug is
withheld until the neutrophils rise to 1.5 x 109
/L and the platelets count to at least 75 x
109
/L. Patients with recurrent, therapy-induced myelosuppression can have the
Imatinib dose reduced to 300mg daily until blood count normalizes (minimum dose
for therapeutic blood levels in adults). However, if the myelosuppression is related to
blastic transformation, Imatinib is discontinued with appropriate supportive therapy
being given. Women of child-bearing age are advised to use barrier contraception.
Imatinib treatment is withdrawn in patients who develop neutropenia (<1000/mm3
) or
thrombocytopenia (<75,000/mm3
) while on therapy, until the cytopenias are
corrected, and are re-commenced at lower doses.
2.2 Predictive Modeling
Predictive research aims at predicting future events or an outcome based on
patterns within a set of variables and has become increasingly popular in medical
research (Agbelusi, 2014; Idowu et al., 2015). Accurate predictive models can inform
patients and physicians about the future course of an illness or the risk of developing
illness and thereby help guide decisions on screening and/or treatment (Waijee et al,
2013a). There are several important differences between traditional explanatory
research and predictive research. Explanatory research typically applies statistical
methods to test causal hypothesis using prior theoretical constructs. In contrast,
predictive research applies statistical methods and/or machine learning techniques,

33
without preconceived theoretical constructs, to predict future outcomes (e.g.
predicting the risk of hospital readmission) (Breiman, 1984).
Although, predictive models may be used to provide insight into casualty of
pathophysiology of the outcome, casualty is neither a primary aim nor a requirement
for variable inclusion (Moons et al., 2009). Non-causal predictive factors may be
surrogates for other drivers of disease, with tumor markers as predictors of cancer
progression or recurrence being the most common example. Unfortunately, a poor
understanding of the differences in methodology between explanatory and predictive
research has led to a wide variation in the methodological quality of prediction
research (Hemingway et al., 2009).
2.2.1 Types of predictive models
Machine learning has been previously used to predict behavior outcomes in
business, such as identifying consumer preferences for product based on prior
purchasing history. A number of different techniques to develop predictive
algorithms exist, using a variety of prediction analytic tools/software and have been
described in detail in literature (Waijee et al., 2010; Siegel et al., 2011). Some
examples include neural networks, support vector machines, decision trees, naïve
Bayes etc. Decision trees, for example, use techniques such as classification and
regression trees, boosting and random trees to predict various outcomes.
Machine learning algorithms, such as random-forest approaches have several
advantages over traditional explanatory statistical modeling, such as lack of a
predefined hypothesis, making it less likely to overlook unexpected hypothesis (Liaw
and Weiner, 2012). Approaching a predictive problem without a specific causal
hypothesis can be quite effective when many potential predictors are available and

34
when there are interactions between predictors, which are common in engineering,
biological and social causative processes. Predictive models using machine learning
algorithms may therefore facilitate the recognition of important variables that may
otherwise not be initially identified (Waijee et al., 2010). In fact, many examples of
discovery of unexpected predictor variables exist in the machine learning literature
(Singal et al., 2013).
2.2.2 Developing a predictive model
The first step in developing a predictive model, when using traditional
regression analysis, is selecting relevant candidate predictor variables for possible
inclusion in the model; however, there is no consensus for the best strategy to do so
(Royston et al., 2009). A backward-elimination approach starts with all candidate
variables, hypothesis tests are sequentially applied to determine which variables
should be removed from the final model, whereas a full-model approach includes all
candidate variables to avoid potential over-fitting and selection bias. Previously
reported significant predictor variables should typically be included in the final model
regardless of their statistical significance but the number of variables included is
usually limited by the sample size of the dataset (Greenland, 1989).
Inappropriate selection of variables is an important and common cause of poor
model performance in this situation. Selection of variables is less of an issue using
machine learning techniques given that they are often not solely based on predefined
hypothesis (Ibrahim et al., 2012). There are several other important issues relating to
data management when developing a predictive model, such as dealing with missing
data and variable transformation (Kaambwa et al., 2012; Waijee et al., 2013b).

35
2.2.3 Validating a predictive model
For a prediction model to be valuable, it must not only have predictive ability
in the derivation cohort but must also perform well in a validation cohort
(Hemingway et al., 2009). A model‘s performance may differ substantially between
derivation and validation cohorts for several reasons including over-fitting of the
model, missing important predictor variables, and inter-observer variability of
predictors leading to measurement errors (Altman et al., 2009). Therefore model
performance in the derivation dataset may be overly optimistic and is not a guarantee
that the model will perform equally well in a new dataset. A number of published
prediction research focuses solely on model derivation and validation studies are very
scarce (Waijee et al., 2013b).
Validation can be performed using internal and external validation. A
common approach to internal validation is to split the data into two portions – a
training set and validation set. If splitting the dataset is not possible given the limited
available data, measures such as cross validation or bootstrapping can be used for
internal validation (Steyerberg et al., 2010). However, internal validation nearly
always yields optimistic results given that the derivation and validation dataset are
very similar (as they are from the same dataset). Although external validation is more
difficult as it requires data collected from similar sources in a different setting or a
different location, it is usually preferred to internal validation (Steyerberg et al.,
2001). When a validation study shows disappointing results, researchers are often
tempted to reject the initial model and develop a new predictive model using the
validation dataset.

36
2.2.4 Assessing the performance of predictive model
When assessing model performance, it is important to remember that
explanatory models are judged based on the strength of associations, whereas
predictive models are judged solely on their ability to make accurate predictions. The
performance of a predictive model is assessed using several complementary tests,
which assess overall performance; calibration, discrimination, and reclassification
(Steyerberg et al., 2010) (Table 2.6). Performance characteristics should be
determined and reported for both the derivation and validation datasets. The overall
model performance can be measured using R2
, which characterizes the degree of
variation in risk explained by the model (Gerds et al., 2008). The adjusted R2
has
been proposed as a better measure, as it accounts for the number of predictors and
helps preventing over-fitting. Brier scores are similar measures of performance,
which are used when the outcome of interest is categorical instead of continuous
(Czado et al., 2009).
Calibration is the difference between observed and predicted event rates for
groups of dataset and is assessed using the Hosmer-Lemeshow test (Hosmer et al.,
1997). Discrimination is the ability of a model to distinguish between records which
do and do not experience an outcome of interest, and it is commonly assessed using
Receiver Operating Characteristics (ROC) curves (Hagerty et al., 2005). However,
ROC analysis alone is relatively insensitive for assessing differences between good
predictive models (Cook, 2007); therefore several relatively novel performance
measures have been proposed. The net reclassification improvement and integrated
discrimination improvement are measures used to assess changes in predicted
outcome classification between two models (Pencina et al., 2012).

37
Table 2.6: Performance characteristics for a predictive model (measures of predictive error)
Aspect Measure Outcome Measure Description
Overall Performance R2
Adjusted R2
Brier Score
Continuous
Continuous
Categorical
Average squared difference between predicted and
observed outcome
Same as R2
, but penalizes for the number of predictors
Average square distances from the predicted and the
observed outcomes
Discrimination ROC curve (c statistic)
C-index
Continuous or categorical
Cox-model
Overall measure of how effectively the model
differentiates between events and non-events
Calibration Hosmer-Lemeshow test Categorical Agreement between predicted and observed risks
Reclassification Reclassification table
NRI
IDI
Categoricala
Number of records that move from one category to
another by improving the prediction model.
A quantitative assessment of the improvement in
classification by improving the prediction model.
Similar to NRI but using all possible cutoffs to
categorize events and non-events.
IDI, Integrated discrimination index; NRI, net classification index
a
Can be performed for continuous data as well if a risk cutoff is assigned.
(Source: Waijee et al., 2013b)

38
2.3 Machine Learning
Machine learning (ML) is a branch of artificial intelligence that allows
computers to learn from past examples using statistical and optimization techniques
(Quinlan, 1986; Cruz and Wishart, 2006). There are several applications for machine
learning, the most significant of which is predictive modeling (Dimitologlou et al.,
2012). Every instance (records/set of fields or attributes) in any dataset used by
machine learning algorithms is represented using the same set of features
(attributes/independent variables). The features may be continuous, categorical or
binary. If the instances are given with known labels (the corresponding target
outputs), then the learning is called supervised, in contrast to unsupervised learning,
where instances are unlabeled (Ashraf et al., 2013).
Supervised classification is one of the tasks most frequently carried out by
Social Intelligent Systems. Thus, a large number of techniques have been developed
based on Artificial Intelligence (Logic-based techniques, perceptron-based
techniques) and Statistics (Bayesian networks, Instance-based networks). The goal of
supervised learning is to build a concise model of the distribution of class labels in
terms of predictor features. The resulting classifier is then used to assign class labels
to the testing instances where the values of the predictor features are known, but the
value of the class label is known (Gauda and Chahar, 2013). There are two variations
of supervise classifications:
a. Regression (or Prediction/Forecasting) – the class label is represented
by a continuous variable (e.g. real number; and
b. Classification – the class label is represented by discrete values (e.g.
categorical or nominal)

39
Unsupervised machine learning algorithms perform learning tasks used for
inferring a function to describe hidden structure from unlabeled data – data without a
target class (Sebastiani, 2002). The goal of unsupervised machine learning is to
identify different examples that belong to the same group/clusters based underlying
characteristics that is common among attributes of members of the same cluster or
groups (Zamir et al., 1997; Jain et al., 1999; Zhao and Karypis, 2002). The only
things that unsupervised learning methods have to work with are the observed input
patterns , which are often assumed to be independent samples from an underlying
unknown probability distribution , and some explicit or implicit a priori
information as to what is important. Examples of unsupervised machine learning
algorithms include:
a. Clustering;
b. Maximum likelihood estimate;
c. Feature selection;
d. Association rule learning etc. (Becker and Plumbley, 1996)
2.3.1 Supervised machine learning algorithms
Supervised learning entails a mapping between a set of input variables
(features/attributes) labeled and an output variable (where j is the number of
records/CML cases) and applying this mapping to predict the outputs for unseen data
(data containing values for but no . Supervised machine learning is the most
commonly used machine learning technique in engineering and medicine.
In supervised machine learning paradigm, the goal is to infer a function, f:

40
This function, f is the model inferred by the supervised ML algorithm from a
sample data or training set composed of pairs of (inputs ( ) and output( )) such
that and :
Typically, for regression problems, (where d is the dimension (or
number of features) of the vector, ) and ; for classification problems
are discrete while for binary classification .
In the statistical learning framework, the first fundamental hypothesis is that
the training data are independently and identically generated from an unknown but
fixed joint probability distribution function . The goal of the learning
algorithm is to find a function, f attempting to model the dependency encoded in
between the input, X and the output, Y. will denote the set of functions
where the solution, f is sought such that where is the set of all possible
functions, f.
The second fundamental concept is the notion of error or loss to measure the
agreement between the prediction f(X) and the desired output Y. A loss (or cost)
function, L is introduced to evaluate this error (see equation 2.3):
The choice of the loss function L(f(X), Y) depends on the learning problem
being solved. Loss functions are classified according to their regularity or singularity
properties and according to their ability to produce convex or non-convex criteria for
optimization.
In the case of pattern recognition, where , a common choice for
L is the misclassification error which is measured as follows:

41
| |
This cost is singular and symmetric. Practical algorithmic considerations may
bias the choice of L. For instance, singular functions may be selected for their ability
to provide sparse solutions.
For unsupervised learning the problem may be expressed in a similar way
using the loss function defined in equations (2.5) and (2.6):
( ) ( )
The loss function L leads to the definition of the risk for a function f. also
called the generalization error:
∫
In classification, the objective could be to find the function f in that
minimizes R(f). Unfortunately, it is not possible because the joint probability P(x, y)
is unknown. From a probabilistic point of view, using the input and output random
variable notations X and Y, the risk can be expressed in equation (2.8) which can be
rewritten in two expectations:
( )
|
The expression in equation (2.9) offers the opportunity to separately minimize
[ | ] with respect to the scalar value of f (x). The resulting
function is the Bayes estimator associated with the risk R.
The learning problem is expressed as a minimization of R for any classifier f.
As the joint probability is unknown, the solution is inferred from the available training

42
set ( ). There are two ways to address the problem.
The first approach, called generative based, tries to approximate the joint probability
P(X, Y), or P(Y|X)P(X), and then compute the Bayes estimator with the obtained
probability. The second approach, called discriminative-based, attacks the estimation
of the risk R (f) head on (Liaw and Weiner, 2012). Following is a description of some
of the most popular and effective supervised machine learning algorithm.
a. Decision Trees (DT)
Decision trees learning uses a decision tree as a predictive model which maps
observations about the relevant indicators for CML survival classification Xij in order
to conclude about the target value - the patient‘s survival class (as survived and not
survived). Decision trees can either be classification or regression trees but for this
study, classification trees were adopted which can be used as input for decision
making using the data description using a top-down tree (Quinlan, 1986; Breiman et
al., 1984). Each interior node (starting from the root/parent node) of the tree
represents the attributes (features relevant for CML survival) with edges that
correspond to the values/labels of each attributes leading to a child node at the
bottom; the process continues for each subsequent values until the leaf is reached
which is the terminal node also representing the target class (class of CML survival) –
alongside the probability distribution over the class (Friedman, 1999).
Such decision trees algorithm include: ID3 (Iterative Dichotomiser 3), C4.5
(an extension of the ID3), CART (Classification and Regression Trees), CHAID (Chi-
squared Automatic Interaction Detector), MARS etc. In this study, the C4.5 decision
trees algorithm was considered.

43
The tree is learned by splitting the training dataset into subsets based on an
attribute value test for each input variables; the process is repeated on each derived
subset in a recursive manner called recursive partitioning. The recursion is completed
when the subset at a node has all the same value of the target class, or when splitting
no longer adds value to the predictions. This is also called the Top-down induction of
trees (Rokach and Maimon, 2008), an example of a greedy algorithm (divide-and-
conquer). When constructing the tree different decision trees algorithm uses different
metrics for measuring best attributes for splitting the tree which generally measure the
homogeneity of the target class (survival of CML) within the subsets of attributes
(relevant indicators for CML survival) selected (Rokach and Maimon, 2005). Some
of such metrics include: Gini impurity, Information gain, Variance reduction etc.; the
C4.5 decision trees algorithm uses the information gain evaluation metrics. The
information gain criterion is defined by equation (2.10)
If S is the training dataset containing the set of attributes, ai predictive for
CML survival in patients, j defined as with values needed for
partitioning S. Then:
∑
| |
| |
( )
Where:
∑
| |
| |
| |
| |
b. Support Vector Machines (SVM)
Support vector machines (SVMs) also called support vector networks are
supervised learning models with associated learning algorithms that analyze data and
recognize patterns (Cortes and Vapnik, 1995). Consider a training dataset consisting

44
of CML survival indicators representing the input vector and the CML survival class
for each patient in the training dataset representing the target vector representing one
of two target categories; SVM attempts to build a model that assigns new examples
into one category (or the other) making it a non-probabilistic binary linear classifier.
An SVM model is a representation of the examples as points in space, mapped so that
the examples of the separate categories are divided by a clear gap that is as wide as
possible. New examples are then mapped into that same space and predicted to
belong to a category based on which side of the gap they fall on.
In formal terms, SVM constructs a hyper-plane or set of hyper-planes in a
high-dimensional space, which can be applied for classification, regression or any
other task. A good separation is achieved by the hyperplane that has the largest
distance to the nearest training data point of any class – the support vectors since in
general the larger the margin the lower the generalization error of the classifier.
SVMs can be used to perform both supervised and unsupervised machine learning
needed for developing classification and regression models. In order to calculate the
margin between data belonging to the two different classes, two parallel hyper-planes
(blue lines) are constructed, one on either side of the separating hyper-plane (solid
black line), which are pushed up against the two datasets (the corresponding survived
and not survived datasets). A good separation is achieved by the hyperplane that has
the largest distance to the neighbouring data points of both classes, since in general
the larger the margin the lower the generalization error of the SVM classifier.
The parameters of the maximum-margin hyperplane are derived by solving
large quadratic programming (QP) optimization problems. There exist several
specialized algorithms for quickly solving these problems that arise from SVMs,
mostly on heuristics for breaking the problem down into smaller chunks. This study

45
implements the John Platt‘s (Platt, 1998) sequential minimal optimization (SMO)
algorithm for training the support vector classifier. SMO works by breaking the large
QP problem into a series of smaller 2-dimensional sub-problems. This study
implements the SMO using the algorithm available in the Weka public domain
software. This implementation globally replaces all missing values and transforms
nominal attributes into binary values, and by default normalizes all data.
Considering the use of a linear support vector as shown in figure 2.6, it is
assumed that both classes are linearly separable. The feature subset representing the
training data containing the information about each CML patient using the relevant
features (risk indicators for CML survival) is expressed as: while
the target class is represented by . The hyperplane can be
defined by where and . Since the classes are linearly
separable, then the following function can be determined:
The decision function may be expressed as with:
The SVM classification method aims at finding the optimal hyper-plane based
on the maximization of the margin between the training data for both classes. The
distance between a point x and the hyperplane is
|| ||
, it is easy to show the
optimization problem as the following:
|| ||

46
Figure 2.6: Description of the linear SVM classifier

47
c. Artificial Neural Network (ANN) - Multi-layer Perceptron (MLP)
An artificial neural network (ANN) is an interconnected group of nodes, akin
to the vast network of neurons in a human brain. In machine learning and cognitive
science, ANNs are a family of statistical learning models inspired by biological neural
networks and are used to estimate or approximate functions that depend on a large
number of inputs and are generally unknown (McCulloch and Walter, 1943). ANNs
are generally presented as systems of interconnected neurons which send messages to
each other such that each connection have numeric weights that can be tuned based on
experience, making neural nets adaptive to inputs and capable of learning.
The word network refers to the inter-connections between the neurons in the
different layers of each system. The first layer has input neurons which send data via
synapses to the middle layer of neurons, and then via more synapses to the third layer
of output neurons. The synapses store parameters called weights that manipulate the
data stored in the calculations. An ANN is typically defined by three (3) types of
parameters, namely:
i. Interconnection pattern between the different layers of neurons;
ii. Learning process for updating the weights of the interconnections; and
iii. Activation function that converts a neuron‘s weighted input to its output
activation.
The simplest kind of neural network is a single-layer perceptron network,
which consists of a single layer of output nodes; the inputs are fed directly to the
outputs via a series of weights. In this way it can be considered the simplest kind of
feed-forward network. The sum of the products of the weights and the inputs is
calculated in each node, and if the value is above some threshold (typically 0) the

48
neuron fires and takes the activated value (typically 1); otherwise it takes the
deactivated value (typically -1).
A perceptron can be created using any values for the activated and deactivated
states as long as the threshold value lies between the two. Perceptrons can be trained
by a simple learning algorithm that is usually called the delta rule. It calculates the
errors between calculated output and sample output data, and uses this to create an
adjustment to the weights, thus implementing a form of gradient descent.
Multi-layer networks use a variety of learning techniques, the most popular
being back-propagation. Here, the output values are compared with the correct answer
to compute the value of some predefined error-function. By various techniques, the
error is then fed back through the network. Using this information, the algorithm
adjusts the weights of each connection in order to reduce the value of the error
function by some small amount. After repeating this process for a sufficiently large
number of training cycles, the network will usually converge to some state where the
error of the calculations is small. In this case, one would say that the network has
learned a certain target function.
To adjust weights properly, one applies a general method for non-linear
optimization that is called gradient descent. For this, the derivative of the error
function with respect to the network weights is calculated, and the weights are then
changed such that the error decreases (thus going downhill on the surface of the error
function). For this reason, back-propagation can only be applied on networks with
differentiable activation functions.
Back-propagation, an abbreviation for backward propagation of errors, is a
common method of training artificial neural networks used in conjunction with an
optimization method such as gradient descent. The method calculates the gradient of a

49
loss function with respect to all the weights in the network. The gradient is fed to the
optimization method which in turn uses it to update the weights, in an attempt to
minimize the loss function. It is a generalization of the delta rule to multi-layered
feed-forward networks, made possible by using the chain rule to iteratively compute
gradients for each layer. Back-propagation requires that the activation function used
by the artificial neurons be differentiable.
The back-propagation learning algorithm can be divided into two phases:
propagation and weight update.
a. Phase 1 – Propagation: each propagation involves the following steps:
i. Forward propagation of training pattern‘s input through the neural
network in order to generate the propagation‘s output activations; and
ii. Backward propagation of the propagation‘s output activations through
the neural network using the training pattern target in order to generate
deltas of all output and hidden neurons.
b. Phase 2 – Weight update: for each weight-synapse, hence the following:
i. Multiply its output delta and input activation to get the gradient of the
weight; and
ii. Subtract a ratio (percentage) of the gradient from the weight.
Assuming the input neurons are represented by variables determined by Xi =
{X1, X2, X3 ….Xi} where i is the number of variables (input neurons). The effect of
the synaptic weights, Wi on each input neuron at layer j is represented by the
expression:
Equation (3.16) is sent to the activation function (sigmoid/logistic function) is applied
in order to limit the output to a threshold [-1, +1], thus:

50
The measure of discrepancy between the expected output (p) and the actual output (y)
is made using the squared error measure (E):
Recall however, that the output (p) of a neuron depends on the weighted sum
of all its inputs as indicated in equation (2.14); implying that the error (E) also
depends on the incoming weights of the neuron which needs to be changed in the
network to enable learning. The back-propagation algorithm aims to find the set of
weights that minimizes the error. In this study, the gradient descent algorithm is
applied in order to minimize the error and hence find the optimal weights that satisfy
the problem. Since back-propagation uses the gradient descent method, there is a
need to calculate the derivative of the squared error function with respect to the
weights of the network.
Hence, the squared error function is now redefined as (the ½ is required to
cancel the exponent of 2 when differentiating):
For each neuron, j its output Oj is defined as:
( ) (∑ )
The input to a neuron is the weighted sum of outputs of the previous
neurons. The number of input neurons is n and the variable denotes the weight
between neurons I and j. The activation function is in general non-linear and
differentiable, thus, the derivative of the equation (2.15) is:

51
The partial derivative of the error (E) with respect to a weight is done using
the chain rule twice as follows:
The last term on the left hand side can be calculated from equation (2.20), thus:
(∑ )
The derivative of the output of neuron j with respect to its input is the partial
derivative of the activation function (logistic function) shown in equation (2.21):
( ) ( ) ( ( ))
The first term is evaluated by differentiating the error function in equation
(3.19) with respect to y, so if y is in the outer layer such that y = , then:
However, if j is in an arbitrary inner layer of the network, finding the
derivative E with respect to is less obvious. Considering E as a function of the
inputs of all neurons, l receiving input from neuron j and taking the total derivative
with respect to , a recursive expression for the derivative is obtained:
∑ ( ) ∑ ( )
Thus, the derivative with respect to can be calculated if all the derivatives
with respect to the outputs of the next layer – the one closer to the output neuron –
are known. Putting them all together:

52
With:
{
( ) ( ) ( ( ))
∑ ( ) ( ( ))
Therefore, in order to update the weight using gradient descent, one must
choose a learning rate, . The change in weight, which is added to the old weight, is
equal to the product of the learning rate and the gradient, multiplied by -1:
Equation (3.28) is used by the back-propagation algorithm to adjust the value
of the synaptic weights attached to the inputs at each neuron in equation (2.14) with
respect to the inner layer of the multi-layer perceptron classifier.
2.3.2 General issues of supervised machine learning algorithms
The first step is collecting the dataset required for developing the predictive
model. If a requisite expert is available, he/she suggests which field
(attributes/features) are the most informative. If not, then the simplest method is that
of brute-force, which measures everything available in the hope that the right
(informative or relevant but not redundant) features can be isolated. However, a
dataset collected by the brute-force method is not directly suitable for induction. It
contains in most cases noise and missing feature values, and therefore requires
significant pre-processing (Zhang et al., 2002). For this reason, methods suitable for
removing noise and missing values are important before deciding on the use of the
identified variables needed for developing predictive models using supervised
machine learning algorithms.

53
a. Data preparation and data pre-processing
There is a hierarchy of problems that are often encountered in data preparation
and preprocessing which includes:
i. Impossible input values;
ii. Unlikely input values;
iii. Missing input values; and
iv. Presence of irrelevant input features in the data.
Impossible values should be detected by the data handling software, ideally at
the point of input so that they can be re-entered. These errors are generally
straightforward, such as coming across negative values when positive values are
expected. If correct values cannot be entered, the problem is converted into missing
value category, by removing the data. Variable-to-variable data cleansing is a filter
approach for unlikely values (those values that are suspicious due to their relationship
to a specific probability distribution with a mean of 5, a standard deviation of 3, but a
suspicious value of 10). Table 2.7 shows examples of how such metadata can help in
detecting a number of possible data quality problems.
The process of selecting the instances makes it possible to cope with the
infeasibility of learning from very large dataset. Selection of instances from the
original dataset is an optimization problem that maintains the mining quality while
minimizing the sample size (Liu and Metoda, 2001). It reduces data and enables a
machine learning algorithm to function and work effectively with very large datasets.

54
Table 2.7: Examples for the use of variable-by-variable data cleansing
Problems Metadata Examples/Heuristics
Illegal values Cardinality
Max, Min
Variance,
Deviation
e.g., cardinality (gender) > 2 indicated
problem.
Max, min should not be outside of
permissible range.
Variance, deviation of statistical values
should not be higher than threshold.
Misspellings Feature values Sorting on values often brings misspelled
values next to correct values.
(Source: Kotsiantis et al., 2006)

55
There are a variety of procedures for sampling instances from large dataset.
The most well-known are:
i. Random sampling, which selects a subset of instances randomly.
ii. Stratified sampling, which is applicable when the class values are not
uniformly distributed in the training sets. Instances of the minority class(es)
are selected with greater frequency in order to even out the distribution.
Incomplete data is an unavoidable problem in dealing with most real world
data sources. Generally, there are some important factors to be taken into account
when processing unknown feature values. One of the most important ones is the
source of unknown-ness:
i. A value is missing because it was forgotten or lost:
ii. A certain feature is not applicable for a given instance (e.g., it does not exist
for a given instance).
iii. For a given observation, the designer of a training set does not care about the
value of a certain feature (so-called don’t care values).
Depending on the circumstances, there are a number of methods to choose
from to handle missing data (Batista and Monard, 2003):
i. Method of ignoring instances with unknown feature values: This method is
simplest; it involves ignoring any instances (records) which have at least one
unknown feature value.
ii. Most common feature value: The value of the feature that occurs most often is
selected to be the value for all the unknown values of the feature.

56
iii. Most common feature value in class: In this case, the value of the feature
which occurs most commonly within the same class is selected to be the value
for all the unknown values of the feature.
iv. Mean substitution: The mean value (computed from available cases) is used to
fill in missing data values on the remaining cases. A more sophisticated
solution than using the general feature mean is to use the feature mean for all
samples belonging to the same class to fill in the missing value.
v. Regression or classification methods: a regression or classification model
based on the complete case data for a given feature is developed. This model
treats the feature as the outcome and uses the other features as predictors.
vi. Hot deck inputting: The most similar case to the case with a missing value is
identified, and then a similar case‘s target value for the missing case‘s target
value is substituted.
vii. Method of treating missing feature values as special values: unknown itself is
treated as a new value for the feature that contains missing values.
b. Feature selection
This is the process of identifying and removing as many irrelevant and
redundant features as possible (Yu and Liu, 2004). This reduces the dimensionality of
the data and enables data mining algorithms to operate faster and more effectively.
Generally, features are characterized as:
i. Relevant: are features that have an influence on the target class (output). Their
role cannot be assumed by the rest.
ii. Irrelevant: are features that do not have any influence on the target class.
Their values could be generated at random and not influence the output.

57
iii. Redundant: are features that can take the role of another (perhaps the simplest
way to incur model redundancy).
Feature selection algorithms in general have two (2) components:
i. A selection algorithm that generates proposed subsets of features and attempts
to find an optimal subset and
ii. An evaluation algorithm that determines how good a proposed feature subset
is.
However, without a suitable stopping criterion, the feature selection process may run
repeatedly through the space of subsets, taking up valuable computational time. The
stopping criteria might be whether:
i. addition (or deletion) of any feature does not produce a better subset; and
ii. an optimal subset according to some evaluation function is obtained.
The fact that many features depend on one another often unduly influence the
accuracy of supervised ML classification models. This problem can be addressed by
constructing new features from the basic feature set (Markovitch and Rosenstein,
2002). This technique is called feature construction/transformation. These newly
generated features may lead to the creation of more concise and accurate classifiers.
In addition, the discovery of meaningful features contributes to better
comprehensibility of the produced classifier, and a better understanding of the learned
concept.
c. Algorithm selection
The choice of which specific learning algorithm should be used is a critical
step. The classifier‘s evaluation is most often based on prediction accuracy (the
percentage of correct prediction divided by the total number of predictions). There

58
are at least three techniques which are used to calculate a classifier‘s accuracy
(Waijee et al., 2013b):
i. One technique is to split the training set by using two-thirds (about 67% of
total cases) for training and the one-third for estimating performance (testing).
ii. In another technique, known as cross-validation, the training set is divided
into mutually exclusive and equal-sized subsets and for each subset the
classifier is trained on the union of all other subsets. The average of the error
of each subset is therefore an estimate of the error rate of the classifier.
iii. Leave-one-out validation is a special case of cross validation. All test subsets
consists of a single instance. This type of validation is, of course, more
expensive computationally, but useful when the most accurate estimate of a
classifier‘s error is required.
If the error rate is unsatisfactory, a variety of factors must be examined:
i. Perhaps relevant features of the problem are not being used;
ii. A larger training set is needed;
iii. The dimensionality of the problem is too high; and/or
iv. The selected algorithm is inappropriate or parameter tuning is needed.
A common method for computing supervised ML algorithms is to perform
statistical comparisons of the accuracies of trained classifiers on specific dataset.
Several heuristic versions of the t-test have been developed to handle this issue
(Dietterich, 1998; Nadeau and Bengio, 2003).
2.3.3 Machine learning for cancer prediction and prognosis
According to a literature survey on the application of machine learning in
healthcare data by Cruz and Wishart (2006), machine learning is not new to cancer

59
research. In fact, artificial neural networks (ANNs) and decision trees (DTs) have
been used in cancer detection and diagnosis for nearly 30 years (Circchetti, 1992;
Simes, 1985; Machin et al., 1991), from the detection and classification of tumors via
X-rays and CRT images (Pertricoin, 2004; Bocchi et al., 2004) malignancies from
proteomic and genomic to the classification of malignancies from proteomic and
genomic (microarray) assays (Zhon et al., 2004; Wang et al., 2005).
The fundamental goal of cancer prediction and prognosis are distinct from the
goals of cancer detection and diagnosis. In cancer prediction/prognosis one is
concerned with three predictive foci, namely:
a. The prediction of cancer susceptibility (i.e. risk assessment): involves an
attempt to predict the likelihood of developing a type of cancer prior to the
occurrence of the disease;
b. The prediction of cancer recurrence: involves the prediction of the likelihood
of redeveloping cancer after the apparent resolution of the disease; and
c. The prediction of cancer survivability: involves the prediction of the outcome
(life expectancy, survivability, progression, tumor-drug sensitivity) after the
diagnosis of the disease.
In the latter two situations, the success of the prognostic prediction is
obviously dependent, in part, on the success or quality of the diagnosis performed.
However a disease prognosis can only come after a medical diagnosis and a
prognostic prediction must take into account more than just a simple diagnosis
(Hagerty et al., 2005).
Indeed, a cancer prognosis typically involves multiple physicians from
different specialties using different subsets of biomarkers and multiple clinical

60
factors, including the age and general health of the patient, the location and type of
cancer, as well as the grade and size of the tumor (Fielding et al., 1992; Cochran et
al., 1997; Burke et al., 2005). Histological (cell-based), clinical (patient-based) and
demographic (population-based) information must be carefully integrated by the
attending physician to come up with a reasonable prognosis. Even for the most
skilled clinician, it is not an easy job to do since similar challenges exist for both
physicians and patients alike when it comes to the issues of cancer prevention and
cancer susceptibility prediction. Family history, age, diet, Body Mass Index (BMI),
high risk habits (like smoking and drinking) and exposure to environmental
carcinogens (UV radiation, radon and asbestos) all play an important role in
predicting an individual‘s risk for developing cancer (Bach et al., 2003; Gascon et al.,
2004; Domchek et al., 2003).
In the past, the dependency of clinicians and physicians alike on macro-scale
information (tumor, patient, population and environmental data) generally kept the
number of variables small enough so that standard statistical methods or even the
physician‘s own intuition could be used to predict cancer risks and outcomes.
However, with today‘s high-throughput diagnostic and imaging technologies,
physicians are now faced with dozens or even hundreds of molecular, cellular and
clinical parameters. In these situations, human intuition and standard statistics do not
generally work efficiently; rather there is a reliance on non-traditional and intensively
computational approaches such as machine learning (ML). The use of computers
(and machine learning) in disease prediction and prognosis is part of a growing trend
towards personalized, predictive medicine (Weston and Hood, 2004).
Machine learning, like statistics is used to analyze and interpret data. Unlike
statistics, though, machine learning methods can employ Boolean logic (AND, OR,

61
NOT), absolute conditionality (IF, THEN, ELSE), conditional probabilities (the
probability of X given Y) and unconventional optimization strategies to model data or
classify patterns. These latter methods actually resemble the approaches humans
typically use to learn and classify. Although, machine learning draws heavily from
statistics and probability, it is still fundamentally more powerful because it allows
inferences or decisions to be made that could not otherwise be made using
conventional statistical methodologies (Mitchell, 1997; Duda et al., 2001).
Many statistical methods employ multivariate regression or correlation
analysis and these approaches assume that the variables are independent and that data
can be modeled using linear combinations of these variables. When the relationship
are non-linear and the variables are interdependent (or conditionally dependent)
conventional statistics usually flounders. It is in these situations that machine
learning tends to shine. Many biological systems are fundamentally non-linear and
their parameters conditionally dependent. Many simple physical systems are linear
and their parameters are essentially independent.
Knowing which machine learning method is best for a given problem is not
inherently obvious. This is why it is critically important to try more than one machine
learning method at any given training set. Another common misunderstanding about
ML is that patterns a ML tool finds or the trends it detects are non-obvious or not
intrinsically detectable. On the contrary, many patterns or trends could be detected by
a human expert – if they looked hard enough at the dataset. Machine learning
basically saves the time and effort needed to discover the pattern or to develop the
classification scheme required.

62
2.4 Feature Selection for the Identification of Relevant Attributes
Feature Selection (FS) is important in machine learning tasks because it can
significantly improve the performance by eliminating redundant and irrelevant
features at the same time speeding up the learning task (Yildirim, 2015). Given N
features, the FS problem is to find the optimal subset among 2N
possible choices.
This problem usually becomes intractable as N increases. Feature subset selection is
the process of identifying and removing as much irrelevant and redundant information
as possible (Ashraf, 2013). This reduces the dimensionality of the data and may allow
learning algorithms to operate faster and more effectively (Novakovic, 2011).
In some cases, accuracy on future classification can be improved; in others,
the result is a more compact, easily interpreted representation of the target concept.
Therefore, the correct use of feature selection algorithms for selecting features
improves inductive learning, either in terms of generalization capacity, learning
speed, or inducing the complexity of the induced model (Kumar and Minz, 2014).
There are two major approaches to FS. The first is Individual Evaluation, and the
second is Subset Evaluation. Ranking of the features uses a weight to measure the
degree of relevance of a feature in the former method while candidate subset of
features are constructed using a search strategy in the latter method.
A feature selection algorithm (FSA) is a computational solution that is
motivated by a certain definition of relevance. However, the relevance of a feature
(or a subset of features) – as seen from inductive learning perspectives – may have
several definitions depending on the objective that is sought by the FS technique.

63
2.4.1 The relevance of a feature
The purpose of a FSA is to identify relevant features according to a definition
of relevance. However, the notion of relevance in ML has not yet been rigorously
defined on a common agreement (Bell and Wang, 2000). Let Ei with , be
domains of features X = {x1, x2, x3… xn} and an instance space defined as
, where an instance is a point in this space. Consider p a probability
distribution on E and T a space of target labels. The motive is to model or identify an
objective function according to its relevant features. A dataset S composed
by |S| instances can be seen as the result of sampling the attributes, E under the
distribution, p a total of |S| times and labeling its element suing the objective function,
c.
The notion of relevance according to a number of researchers is defined as a
relative relationship between the attributes and the objective function, the probability
distribution, sample, entropy or incremental usefulness (Novakovic et al, 2011;
Novakovic, 2009). Following are a number of definitions for the relevance of a
feature set of attributes.
a. Definition I (relevance with respect to an objective function, c): A feature
is relevant to an objective c if there exist two examples A, B in the
instance space E such that A and B differ only in their assignment to xi
and .
In other words, if there exist two instances that can only be classified by xi.
The definition has the inconvenience that the learning algorithm can not necessarily
determine if a feature xi is relevant or not, using only a Sample S or E (Wang et
al.1998).

64
b. Definition II (strong relevance with respect to the sample, S): A feature
is strongly relevant to the sample S if there exist two examples A, B
that only differ in their assignment to xi and .
The definition is the same as I, but now, A, B and the definition is with
respect to S (Blum and Langley, 1997).
c. Definition III (strong relevance with respect to the distribution, p): A
feature is weakly relevant to an objective c in the distribution p if there
exist two examples A, B with p that only differ in
their assignment to xi and .
This definition is the natural extension of II and, contrary to it, the distribution
p is assumed to be known.
d. Definition IV (weak relevance with respect to the sample, S): A feature
is weakly relevant to the sample S if there exist at least a proper
where xi is strongly relevant with respect to S.
A weakly relevant feature can appear when a subset containing at least one
strongly relevant feature is removed.
e. Definition V (weak relevance with respect to a distribution, p): A feature
is weakly relevant to the objective c in the distribution p if there exist at
least proper where xi is strongly relevant with respect to p.
Instead of focusing on which features are relevant, it is possible to use
relevance as a complexity measure with respect to the objective c. In this case, it will
depend on the type of inducer used.
f. Definition VI (relevance as a complexity measure) (Blum and Langley,
1997): Given a data sample S and an objective c, define r(S, c) as the smallest

65
number of relevant features to c using I only in S, and such that the error in S
is the least possible for the inducer.
It refers to the smallest number of features by a specific inducer to each
optimum performance in the task of modeling c using S.
g. Definition VII (relevance as an incremental usefulness) (Caruana and
Freitag, 1994): Given a data sample S, a learning algorithm L, and a subset of
features X’, the feature xi is incrementally useful to L with respect to X’ if the
accuracy of the hypothesis that L produces using the group of features
is better than the accuracy reached using only the subset of features
X’.
This definition is especially natural in feature selection algorithms (FSAs) that
search in the feature space in an incremental way, adding or removing features to a
current solution. It is also related to a traditional understanding of relevance in the
philosophy literature.
h. Definition VIII (relevance as an entropic measure) (Wang et al., 1998):
Denoting the (Shannon) entropy by H(x) and the mutual information by I(x; y)
= H(x) – H(x|y) (the difference of entropy in x generated by the knowledge of
y), the entropic relevance of x to y is defined as r(x; y)= I(x; y)/H(y). let X be
the original set of features and let C be the objective seen as a feature set
is sufficient if I(X’; C) = I(X. C) (i.e. if it preserves the learning
information). For a sufficient set X’, it turns out that r(X’; C) = r(X, C). the
most favourable set is that sufficient set for which H(X’) is smaller.
This implies that r(C; X) is greater. In short, the aim is to have r(C; X’) and
r(X’; C) jointly maximized.

66
2.4.2 Characteristics of feature selection algorithms
Feature selection algorithms (with a few notable exceptions) perform a search
through the space of feature subsets, and, as a consequence, must address four (4)
basic issues affecting the nature of the search (Langley and Sage, 1994; Patil and
Sane, 2014):
a. Starting point
Selecting a point in the feature subset from which to begin the search can
affect the direction of the search. One option is to begin with no features and
successively add attributes. In this case, the search is said to proceed forward through
the search space. Conversely, the search can begin with all features and successively
remove them. In this case, the search proceeds backwards through the search space.
Another alternative is to begin somewhere in between (in the middle) and move
outwards from this point.
b. Search organization
An exhaustive search of the feature subspace is prohibitive for all but a small
initial number of features. With N initial features, there exists 2N
possible subset of
features. Heuristic search strategies are more feasible than exhaustive search methods
and can also give good results, although they do not guarantee finding the optimal
subset (Hall et al., 2009). A number of search methods are highlighted as follows:
 BestFirst: It searches for the attribute subset by greedy hill climbing method
in combination with backtracking. The backtracking is based on the concept
that if some number of consecutive nodes is found such that they do not
improve the performance then backtracking is done. It may apply forward
approach where it starts from empty set of attributes and goes on adding the
next. It may also go for backward approach where it starts from a set of all

67
attributes and removes one by one. It may also adopt a midway between both
approaches where search is done in both directions (by considering all
possible single attribute additions and deletions at a given point) which is also
called as hybrid approach (Maji and Garai, 2013).
 GreedyStepwise: Performs a greedy forward or backward search through the
space of attribute subsets. May start with no/all attributes or from an arbitrary
point in the space. Stops when the addition/deletion of any remaining
attributes results in a decrease in evaluation. Can also produce a ranked list of
attributes by traversing the space from one side to the other and recording the
order that attributes are selected.
 Ranker: Individual evaluations of the attributes are done and they are ranked
accordingly (Hua-Liang and Billings, 2007). It is normally used in conjunction
with attribute evaluators (Relief, GainRatio, Entropy etc.).
 Genetic Search: Genetic Algorithms (GAs) (Goldberg, 1989) are
optimization techniques that use a population of candidate solutions. They
explore the search space by evolving the population through four steps: parent
selection, crossover, mutation, and replacement. GAs have been seen as search
procedures that can locate high performance regions of vast and complex
search spaces, but they are not well suited for fine-tuning solutions (Holland,
1992). However, the components of the GAs may be specifically designed and
their parameters tuned, in order to provide effective local search behaviour.
c. Evaluation strategy
How feature subsets are evaluated is the single biggest differentiating factor
among most feature selection algorithms for machine learning. One paradigm,
dubbed Filter (distance, information, consistency and dependency metrics etc.)

68
(Kohavi, 1995; Kohavi and John, 1996) operates independently of any machine
learning algorithm – desirable features are filtered out of the data before learning
begins. These algorithms use heuristics based on general characteristics of the data to
evaluate merit of feature subsets. Another school of thought argues that the bias of a
particular induction algorithm should be taken into account when selecting features.
This method, called the wrapper (using predictive accuracy or cluster goodness) uses
an induction algorithm along with a statistical re-sampling technique such as cross-
validation to estimate the final accuracy of feature subsets.
d. Stopping criterion
A feature selector must decide when to stop searching through the space of
feature subsets. Depending on the evaluation strategy, a feature selector might stop
adding or removing features when none of the alternatives improves upon the merit of
a current feature subset. Alternatively, the algorithm might continue to revise the
feature subset as long as the merit does not degrade.
2.4.3 Filter-based feature selection methods
Among the evaluation n strategy used by feature selection methods, filter-
based feature selection (FS) methods were considered in this study to determine the
relevant features among the features present in the data collected from the study
location (Maji and Garai, 2013). This is because filter-based FS algorithms define
relevance by identifying the attributes that are more correlated with the target class
and filter-based FS algorithms are less computationally expensive compared to
wrapper-based FS algorithms which require an improvement of the supervised
machine learning algorithm.
Three (3) classes of filter-based feature selection methods considered are as
follows:

69
 Consistency-based
Consistency measures the attempt to find a minimum number of features that
distinguish between the classes as consistently as the full set of features. An
inconsistency arises when multiple training samples have the same feature values, but
different class labels. Dash and Liu (1997) presented an inconsistency – based FS
technique called Set Cover. An inconsistency count is calculated by the difference
between the number of all matching patterns (except the class label) and the largest
number of patterns of different class labels of a chosen subset. If there are n matching
patterns in the training sample space and there are c1 patterns belonging to class 1 and
c2 patterns belonging to class 2, and if the largest number is c2, the inconsistency
count will be n – c2. Hence, given a training sample S the inconsistency count of an
instance is defined as (Liu and Motoda, 1998):
Where is the number of instances in S equal to A using only the features in
and is the number of instances in S of class k equal to A using only the
features in .
By summing all the inconsistency counts and averaging over the size of the
training sample size, a measure called the inconsistency rate for a given subset is
defined. The inconsistency rate of a feature subset in a sample S is then:
∑
| |
 Correlation-based (CFS)
Correlation is also called similarity measures or dependency measures.
Gennari et al. (1989) stated that Features are relevant if their values vary
systematically with category membership thus, a feature is useful if it is correlated

MSc Thesis - CML Survival Prediction-Final Correction

MSc Thesis - CML Survival Prediction-Final Correction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to MSc Thesis - CML Survival Prediction-Final Correction

Similar to MSc Thesis - CML Survival Prediction-Final Correction (20)

MSc Thesis - CML Survival Prediction-Final Correction