Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed Logistic Model Trees

985 views

Published on

Classification algorithms play an important role in different business areas, such as fraud detection, cross selling or customer behavior. In the business context, interpretability is a very desirable property, sometimes even a hard requirement. However, interpretable algorithms are usually outperformed by other non-interpretable algorithms such as Random Forest. In this talk Antonio Soriano and Mateo Alvarez presented a distributed implementation in Spark of the Logistic Model Tree (LMT) algorithm (Landwehr, et al. (2005). Machine Learning, 59(1-2), 161-205.), which consists of a decision tree with logistic classifiers in the leaves. While being highly interpretable, the LMT consistently performs equally or better than other popular algorithms in several performance metrics such as accuracy, precision/recall or area under the ROC curve.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Distributed Logistic Model Trees

  1. 1. Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD
  2. 2. Aerospace Engineer, MSc in Propulsion Systems (UPM), Master in Data Science (URJC). Working as data scientist and Big Data developer at Stratio Big Data in the data science department mateo-alvarez
  3. 3. Ph.D. in Telecommunications, MSc in Electronic Systems Engineering and Telecommunication Technologies, Systems and Networks (UPV), and MSc “Big Data Expert” (UTAD). Working as data scientist and Big Data developer at at Stratio Big Data in the data science department @Phd_A_Soriano
  4. 4. Why using interpretable algorithms instead of “black boxes” Logistic Regression Decision Trees Variance-Bias tradeoff Metrics Demo Logistic Model Trees Distributed implementation Cost function & configuration params Demo
  5. 5. • Why use interpretable algorithms instead of “black boxes” • Logistic Regression • Decision Trees • Variance-Bias tradeoff @StratioBD
  6. 6. Accuracy Explainability VS Medical Studies Power management Financial environment Criminal activity
  7. 7. Threshold Probability Feature
  8. 8. Threshold Probability Local bad adjust Feature
  9. 9. Threshold Probability Feature
  10. 10. Feature 1 Root node Leaf node Leaf node
  11. 11. Feature 1 Root node Feature 2 Feature 2 Leaf node Leaf node Leaf node Leaf node
  12. 12. Feature 1 Root node Feature 2 Feature 2 Leaf node Leaf node Leaf node Leaf node Local bad adjust
  13. 13. MeanError Model complexity Test error Training error Model complexity Variance Total error Bias 2 Error OverfittingUnderfitting
  14. 14. Variance Bias
  15. 15. Missing important variables for the problem to make the predictions Variance Bias
  16. 16. Overfitting to the sample/training data Variance Bias
  17. 17. Irreducible error on prediction Variance Bias
  18. 18. • Logistic Model Trees • Distributed implementation • Cost function & configuration parametres • Demo @StratioBD
  19. 19. Root node Performance metrics
  20. 20. Performance metrics Feature 1 Root node Performance metrics Performance metrics
  21. 21. Feature 1 Root node Performance metrics Performance metrics Performance metrics Performance metrics Performance metrics Performance metrics
  22. 22. Feature 1 Root node Feature 2 Feature 2
  23. 23. Feature 1 Root node Feature 2
  24. 24. Feature 1 Root node Feature 2
  25. 25. DISTRIBUTED IMPLEMENTATION Spark’s Decision Tree (distributed implementation of random forests) Spark’s Logistic Regression / weka’s Logistic Regression on the nodes
  26. 26. LMT Cost function to fix the logistic regression threshold • AccuracyCostFunction • ConfusionMatrix • PrecisionCostFunction • PrecisionRecallCostFunction • RocCostFunction The same cost function for pruning criteria Performance metrics Performance metrics Performance metrics
  27. 27. Big datasets Power of spark to distribute building the tree and logistic regressions ADVANTAGES OF THIS IMPLEMENTATION Medium datasets Distributed tree growth and weka’s logistic regression Small datasets Although it can be slow to distribute the data for the decision tree, cost functions can be still used and specific optimization for particular cases
  28. 28. Example of DLMT algorithm in a synthetic dataset
  29. 29. • Metrics • Demo @StratioBD
  30. 30. PREDICTION Positive Negative TRUE CONDITION Positive True Positives False Negatives Negative False Positives True Negatives Precission True Positive Rate (Recall) False Positive Rate TPR = TP/(TP+FN) Insensitive to unbalance FPR = FP/(FP+TN) Insensitive to unbalance Precision = TP/(TP+FP) Sensitive to unbalance Accuracy = (TP+TN)/(TP+TN+FP+FN) Sensitive to unbalance
  31. 31. PREDICTION Positive Negative TRUE CONDITION Positive True Positives False Negatives Negative False Positives True Negatives True Positive Rate (Recall) False Positive Rate AUROC (AUC): TPR/FPR -> Insensitive to unbalance! TPR FPR Best performance
  32. 32. PREDICTION Positive Negative TRUE CONDITION Positive True Positives False Negatives Negative False Positives True Negatives True Positive Rate (Recall) AUPRC: Precision/TPR -> Sensitive to unbalance! Precission Precision Recall Best performance
  33. 33. f f 1 n Benchmark ABF Data Algorithms
  34. 34. @StratioBD
  35. 35. Accuracy ExplainabilityVS Performance Metrics: AUROC, AUPRC, ACCURACY Automatic Benchmarking Framework f f 1 n BenchmarkABF 1 2 3 4
  36. 36. THANK YOU UNITED STATES Tel: (+1) 408 5998830 EUROPE Tel: (+34) 91 828 64 73 contact@stratio.com www.stratio.com @StratioBD
  37. 37. people@stratio.com WE ARE HIRING @StratioBD

×