Decision trees can be used in medical diagnosis to mimic a doctor's decision-making process. The document describes building a decision tree model on sample patient data that includes physiological measurements and the drug prescribed. The decision tree learns that cholesterol level is the most important factor for determining which drug to prescribe, but it splits the data first on sodium level due to the Gini impurity criterion used. Visualizing the trained tree shows the sequence of decisions and identifies cholesterol and electrolyte levels as key factors in drug selection.
2. Todaywe will seehow we canusedecision tree algorithm in medical domain.A
decision tree is asimple yet powerful supervised learning algorithm that resemblesa
flow chart; we will talk more about that this in just aminute. Decisiontrees are
commonly usedin the fields such asmedicine, astronomy (for example, for filtering
noise from Hubble SpaceTelescopeimagesor to classify star-galaxy clusters),
manufacturing and production (for example, by Boeingto discover flaws in the
manufacturing process),and object recognition (for example, for recognizing 3D
objects).
My main motto of this article is to talk about “Using Decision Trees to make a
Medical Diagnosis”. How we canuseD-tree in medical domain to find out asmuch
information aspossibleand mimic adoctor’s decision-making process. Let’s consider
an example where anumber of patients have suffered from thesameillness, suchas
arare form of “Basorexia”.
Let further assumethat the true causesof the diseaseremain unknown to this day,
andthat all the information that is available to us consists of abunch of physiological
measurements. For example, we might have accessto the followinginformation:
A patient’s blood pressure(‘BP’)
A patient’s cholesterollevel(‘cholesterol’)
A patient’sgender(‘sex’)
A patient’sage(‘age’)
A patient’s blood sodiumconcentration(‘Na’)
A patient’s blood potassiumconcentration(‘K’)
3. Basedon all this information, let’s supposeadoctor maderecommendations tohis
patient to treat their diseaseusingone of four possibledrugs, drug A,B,C or D. we
have available for 20 different patients:
From the data, we canaskthat “what wasthe doctor’s reasoning for prescribing
drugs A, B, Cor D? Canwe seearelationship between apatient’s blood values and
the drug that the doctorprescribed?
Let’s seeif adecision tree canuncover these hiddenrelationships.
Understanding thedata
It is the first step in tackling a new MLproblem. If you cansense the data better than
most of the task is done here itself. Youcanrealize that “Drug” column is not the
feature value like all the other columns. Soit becomesthe “target labels”. In other
words, the inputs to our MLalgorithm will be all blood values,age,and gender of a
patient. Thus,the output will be aprediction of which drug to prescribe.As“Drug”
column is not numerical (though we canmakeit any time for suchasmall dictionary
in form of data),we know that it is a“Classification task”.
Thus, it would be a good idea to remove all ‘drug’entries from the dictionaries. For
this, we need to go through the list and extract the ‘drug’ entry, which is easiest to
do with alistcomprehension.
Forthe sakeof simplicity, we may want to focus on the numerical features first: ‘age’,
‘K’ and ‘Na’. Sowe canplot like this using Matplotlib asfollows:
4. However, this plot is not very informative, becauseall data points have the same
color. Wewant each data point to be colored according to the drug that was
prescribed. So,let’s convert ‘A’ to‘D’ into numerical values. Forthis we canusethe
ASCIIvalue of acharacter.
[Find more at: http://www.asciitable.com]
In python it is accessibleby function ord. Forexample, the character ‘A’ hasvalue 65;
‘B’ has66 and soon.
5. Wecan now passthese integer to the matplotlib’s scatterfunction, which will know
to choosedifferent colors for these different color labels(c=target in the following
code).
Thepreceding code will produce afigure of 2*2 grids containing four subplotsas
follows:-
Can you spot any relationshipbetween featurevalue and target labels?
There are someinteresting observations we canmake. Forexample, from the first
and third subplot, we canseethe light blue points to be clusteredaround high
sodium levels. Similarly, all red points seemto have both low sodium and low
potassium levels. Therest is lessclear. Solet’s seehow “Decision Tree” canhelp us
here.
6. Preprocessing thedata
In order for our data to be understood by our decisiontree algorithm, we need to
convertallcategoricalfeatures into numericalfeatures. Wewill do it using
scikit-learn’s DictVectorizer.Wefeed the dataset that we want to convert to the
fit_transform method:
If we want to seeat the first data point, we match the features namewith the
corresponding feature values:
Tomakesure that our data variables are compatible with OpenCV,we needto
convert everything to floating pointvalues:
Thenall that’s left here to split the data into 15-5.
7. Constructing thetree
Building the decision tree with OpenCVis relatively easythan otheralgorithms. We
cancreate an empty decision tree usingthefollowing:
In order to train the decision tree on the training data, we usethe method train:
Thenwe canpredict the labels of the new data points withpredict:
Wecaneven checkthe score asfollows:
This shows us that we only got 40 percent of the samples right. Since there are only
5 samples to test, I consider it good because 2 out of 5 samples are considered good
in that dictionary form ofdata.
Let’s checkhow it performs on trainingset:
8. Voila!! Decisiontree performs well on the training data becauseitis showing100
percent. This is called “Overfitting” here. Wewill talk aboutit.
Visualizing a trained decisiontree
I think it’s time to switch to scikit- learn. Its implementation allows usto customize
the algorithm and makes it a lot easier to investigate the inner workings of the tree.
It is reside under treemodule.
Wecancreate an empty decision tree usingthe DecisionTreeClassifierconstructor:
Wecantrain it usingthe fit method. Wecancompute theaccuracy score on both
training and test samplesusingthe scoremethod:
9. Here’sthe coolthing; if you want to know what the tree looks like, you can do so
using GraphViz to create aPDFfile (or anyother supported file type) from the
structure. Youhave to install GraphVizfirst using conda orpip command:
Thencome backto IDE,you canimport GraphVizand export the tree inGraphViz
format to afile tree.dot usingscikit-learn’sexport_graphviz exporter:
Thenbackto commandline, you canuseGraphVizto turn tree.dotinto (for example)
aPNGfile:
10. Investigating the inner workings of a decision
tree
Theprocess starts at the root node, where we split the data into two groups, based
on the somedecision rule. Thenthe processis repeated until all remaining samples
have the sametarget label, at which point we have reachedaleaf node. Youcan see
that the first question askedwas whether the sodium concentration wassimilar or
equalto 0.72. Thisresulted in the two subgroups:
All data points where Na<=0.72(node1), which was true for 9 data points
All data points where Na>0.72(node2) which was true for6 data points
11. At node 1, the next question askedwaswhether the remaining data points did not
have high cholesterol levels which were true for 5 data points and false for 4 data
points. At node 3, all 5 remaining data points had the sametarget label, which was
drug C(class=C), meaningthat there wasno more ambiguity to resolve. Wecall such
nodespure. Thus,node 3 becamealeaf node. Backanode 4, the next question
askedwhether sodium levels were lower than 0.445(Na<=0.445),and the remaining
4 data points were split into node 7 and node 8 . At this point, both node 7 and node
8 becameleaf nodes.
Rating the importance offeatures
Theprecedingroot node split the data according to Na<=0.72,but who told the
tree to focus on sodium first? Also, where does the number 0.72come from
anyway?
scikit-learn provides afunction to rate “feature importance”, which is anumber
between 0 and 1 for eachfeature, where 0 means not usedat all in any decisions
madeand 1 meansperfectly predicts thetarget.
Now, it becomes evident that the most telling feature for knowing which drug to
administerto patients was actually whether the patienthad a normal cholesterol
level.Age, sodium, potassium levels were also important. Gender andblood
12. pressure did not seem to make any difference but it doesn’t mean that this
information isuseless.
But, hold on. If cholesterol level is soimportant, why was it not picked asthe first
feature in the tree (that is, the root node)?Whywould you choose to split on the
sodium level first? This is where I needto tell you about that ominous “gini” label in
the figureearlier.
Criterion=’gini’: TheGini impurity is ameasureof misclassification, with the aim
of minimizing the probability of misclassification.Aperfect split of the data,
where eachsubgroup contains data points of asingle target label, would result
in aGini index of 0. Wecanmeasurethe Gini index of every possible split of the
tree, andthen choosethe one that yields the lowest Gini impurity.
Criterion=’entropy’: It is also known as information gain. Entropy is a measure of
the amount of uncertainty associated with a signal or distribution. Aperfect split
of the data wouldhave 0 entropy.
If you want to useentropy, you would type the following:
So,this was avery simple approach to use“DecisionTree” in the medicaldiagnosis.
If you want to practice on areal data set then you canuse“Breast Cancer(Wisconsin)
from UCIMachine learningrepo.