TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Decision Tree for Democratic Primary
1. Kyla Marino EE 471: Machine Learning
3/16/2016
Decision Tree for Democratic Primary
Problem Statement
Politicians attempt to predict voter demographic through, predominately, landline telephone polls. This
causes disparity between the prediction and the outcome since landline telephones lost fashion to cell
phones--which have protect laws against cold-calls. A more accurate method would be predicting the
outcome based on previous voter results.
Theory
A decision tree has a simple structure, reminiscent of their namesake, which can be broken down into
leaf and non-leaf nodes. A leaf is the class name or decision. Each non-leaf node is an attribute test.
Decision trees attempt to divide the data set so each non-leaf node divides the data in an equal manner.
An optimal build will start with the most informative test at the root, followed by the next most
informative, and etc. until all nodes end in leaves [1]. The best decision tree will only contain non-trivial
partitions and be the simpler option if presented with multiple trees.
The attribute with the highest entropy is the most likely to give the most informative partitions. Entropy
is calculated for each attribute using:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑝) = − ∑ 𝑝𝑖 log2 𝑝𝑖𝑖 (eq. 1)
The amount of information gained from an attribute can be further calculated by (where S is the set and
F is the attribute):
𝐺𝑎𝑖𝑛(𝑆, 𝐹) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − ∑
|𝑆 𝑓|
|𝑆|𝑓∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐹) 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑓) (eq. 2)
Method
The data set was collected through the U.S. Census, focusing on three attributes of the Iowa county
populations: education, size, minority percentage. The collected numerical values were further broken
into yes/no to allow easier partitions (see appendix I for data). The population is highly educated if more
than 20% have higher education degrees. A county is large if there are more than 20,000 people. There
is a large minority population if there is more than 4% minority. Iowa was chosen because it was the first
primary voting state.
The top-level pseudocode of the decision tree is:
BEGIN
READ “data.txt”;
CALCULATE entropy
CALCULATE gain
PRINT tree
END
2. Results
Figure 1 shows higher education had the greatest entropy followed by large population and large
minority population. The lower section of the figure displays a “True Classes” which is the winner of
each data point entered into the system.
Figure 1. Decision Tree Clinton/Sanders Results
Conclusion
The final decision tree is not as simple as it could be. The second and third large minority population
nodes are unnecessary since Sanders and Clinton both win the yes/no questions for each branch.
In future runs of the decision tree, a larger data sample should be used. It may also be beneficial to re-
evaluate the guidelines for each attribute’s yes/no answer. A population of 20,000 may be too generous
for a large population.
3. Appendix I:
County
Higher
Education Population
Minority
Population Winner
1 Adair 16.3 no 7,454 no 98.2 no Clinton
2 Adams 13.7 no 3,875 no 97.8 no Clinton
3 Boone 20.3 yes 26,433 yes 96.8 no Sanders
4 Butler 15 no 15,006 no 98.1 no Sanders
5 Calhoun 19.1 no 9,866 no 96.4 no Clinton
6 Carroll 18.8 no 20,562 yes 95.5 yes Clinton
7 Cedar 19.5 no 18,411 no 95.9 yes Sanders
8 Cerro Gordo 21 yes 43,254 yes 95.7 yes Clinton
9 Cherokee 19.6 no 11,836 no 97 no Sanders
10 Chickasaw 13.8 no 12,264 no 98.3 no Clinton
11 Clinton 17.7 no 48,051 yes 94 yes Sanders
12 Dallas 43.6 yes 77,400 yes 92.7 yes Clinton
13 Davis 16.4 no 8,781 no 98.3 no Clinton
14 Des Moines 18.9 no 40,255 yes 90.4 yes Sanders
15 Fremont 20.5 yes 7,022 no 97.6 no Sanders
16 Greene 17.4 no 9,200 no 97.4 no Clinton
17 Grundy 20.7 yes 12,375 no 98.3 no Sanders
18 Guthrie 17.1 no 10,722 no 98 no Clinton
19 Jefferson 31.2 yes 17,325 no 85.6 yes Sanders
20 Jones 17.2 no 20,454 yes 96.2 no Sanders
Iowa (State) 25.7 3107126 92.1
H( C ) = 1.581
Education 0.521
Population 0.530
Minority 0.530
Higher Education|yes Large Population|yes Large Minority|yes
0.275 0.345 0.345
Higher Education|no Large Population|no Large Minority|no
0.690 0.647 0.647
SUM(Higher Education) SUM(Large Population) SUM(Large Population)
0.965 0.992 0.992
Gain (Higher Education) Gain (Large Population) Gain (Large Population)
0.616 0.589 0.589
Description: Predict a county's democratic primary winner based on features of the population.
Population is highly educated if more than 20% have higher education degrees, large if there are
more than 20,000 people, and has a large minority population if there is more than 4%.