Classification algorithms play an important role in different business areas, such as fraud detection, cross selling or customer behavior. In the business context, interpretability is a very desirable property, sometimes even a hard requirement. However, interpretable algorithms are usually outperformed by other non-interpretable algorithms such as Random Forest. In this talk Antonio Soriano and Mateo Alvarez presented a distributed implementation in Spark of the Logistic Model Tree (LMT) algorithm (Landwehr, et al. (2005). Machine Learning, 59(1-2), 161-205.), which consists of a decision tree with logistic classifiers in the leaves. While being highly interpretable, the LMT consistently performs equally or better than other popular algorithms in several performance metrics such as accuracy, precision/recall or area under the ROC curve.
2. Aerospace Engineer, MSc in
Propulsion Systems (UPM), Master
in Data Science (URJC).
Working as data scientist and Big
Data developer at Stratio Big Data in
the data science department
mateo-alvarez
3. Ph.D. in Telecommunications, MSc in
Electronic Systems Engineering and
Telecommunication Technologies,
Systems and Networks (UPV), and MSc
“Big Data Expert” (UTAD).
Working as data scientist and Big Data
developer at at Stratio Big Data in the
data science department
@Phd_A_Soriano
4. Why using interpretable algorithms instead of “black boxes”
Logistic Regression
Decision Trees
Variance-Bias tradeoff
Metrics
Demo
Logistic Model Trees
Distributed implementation
Cost function & configuration params
Demo
5. • Why use interpretable algorithms instead of “black boxes”
• Logistic Regression
• Decision Trees
• Variance-Bias tradeoff
@StratioBD
26. LMT Cost function to fix the logistic regression threshold
• AccuracyCostFunction
• ConfusionMatrix
• PrecisionCostFunction
• PrecisionRecallCostFunction
• RocCostFunction
The same cost function for pruning criteria
Performance
metrics
Performance
metrics
Performance
metrics
27. Big datasets
Power of spark to distribute building the
tree and logistic regressions
ADVANTAGES OF THIS IMPLEMENTATION
Medium datasets
Distributed tree growth and weka’s
logistic regression
Small datasets
Although it can be slow to
distribute the data for the decision
tree, cost functions can be still
used and specific optimization for
particular cases