Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Smart Scalable Feature Reduction with Random Forests with Erik Erlandson

332 views

Published on

Modern datacenters and IoT networks generate a wide variety of telemetry that makes excellent fodder for machine learning algorithms. Combined with feature extraction and expansion techniques such as word2vec or polynomial expansion, these data yield an embarrassment of riches for learning models and the data scientists who train them. However, these extremely rich feature sets come at a cost. High-dimensional feature spaces almost always include many redundant or noisy dimensions. These low-information features waste space and computation, and reduce the quality of learning models by diluting useful features.

In this talk, Erlandson will describe how Random Forest Clustering identifies useful features in data having many low-quality features, and will demonstrate a feature reduction application using Apache Spark to analyze compute infrastructure telemetry data.

Learn the principles of how Random Forest Clustering solves feature reduction problems, and how you can apply Random Forest tools in Apache Spark to improve your model training scalability, the quality of your models, and your understanding of application domains.

Published in: Data & Analytics
  • Be the first to comment

Smart Scalable Feature Reduction with Random Forests with Erik Erlandson

  1. 1. Erik Erlandson Red Hat, Inc. Smart Scalable Feature Reduction with Random Forests
  2. 2. Erik Erlandson • Software Engineer • Radanalytics.io community • Apache Spark on OpenShift • Intelligent Applications in the cloud
  3. 3. Talk • Motivate Feature Reduction • Random Forest Clustering • T-Digest Feature Sketching • RF Feature Reduction • Example: Tox21 Assay Data
  4. 4. Features 2.3 1.0 0.0 1.0 3.1 4.2 6.9 0.0 7.3 Model Training Evaluation R esults Measurable Properties!
  5. 5. Feature Reduction Full Feature Set Identify Useful Features Reduced Feature Set
  6. 6. Feature Sets Can Be Very Large hundreds thousands ... millions
  7. 7. Features Cost Resources Memory Disk Network Time Energy
  8. 8. Features Inject Noise Model Training Without Reduction With Feature Reduction
  9. 9. Features Impact Model Size Model Without Reduction With Reduction
  10. 10. Representation & Transfer Learning
  11. 11. Random Forests Leo Breiman (2001) Ensemble of Decision Tree Models Each tree trains on random subset of data Each split considers random subset of features
  12. 12. Random Forest Clustering F1 F2 . . . Fm Features RF model 34 12 . . . 73 Leaf IDs34 12 73 Cluster these ! Leaf Node IDs
  13. 13. 2 Key Benefits of RF Clustering Features Used by RF Model << Full Feature Set RF Training ignores unhelpful features
  14. 14. Data with a Joint Distribution in R^2
  15. 15. Data with Synthetic
  16. 16. RF Rules for Data (non-synthetic) List((x2 <= -1.32), (x1 <= 0.87)) List((x1 > -1.37), (x2 > 1.03)) List((x2 <= 2.09), (x1 <= 0.87)) List((x1 <= 2.13), (x2 <= -1.32)) List((x2 <= -2.31), (x1 <= 0.87)) X1 > -1.37 X2>1.03
  17. 17. RF Rules in Feature Space
  18. 18. What Features Did the RF Use? List((x2 <= -1.32), (x1 <= 0.87)) List((x1 > -1.37), (x2 > 1.03)) List((x2 <= 2.09), (x1 <= 0.87)) List((x1 <= 2.13), (x2 <= -1.32)) List((x2 <= -2.31), (x1 <= 0.87)) reduced = {“x1”, “x2”} X1 > -1.37 X2>1.03
  19. 19. T-Digest Sketches a Distribution 3.4 6.0 2.5 ⋮ Sketch of CDF P(X <= x) X Data Domain
  20. 20. Inverse Transform Sampling Sample U[0,1] => q x where CDF(x) = q 0 1 CDF
  21. 21. T-Digests Can Aggregate P1 P2 Pn |+| Data in Spark t-digests result Map
  22. 22. Sketching a Feature feature.aggregate(TDigest.empty())( (td, x) => td + x, (td1, td2) => td1 ++ td2 )
  23. 23. Synthesizing Data from TDigests def synthesize(tdVec: Vector[TDigest], n: Int) = { val tdVecBC = sc.broadcast(tdVec) sc.parallelize(1 to n).map { _ => tdVecBC.value.map(_.sample) } }
  24. 24. Random Forest Training Data val fvSketches = sketchFV(trainFV) val synthFV = synthesize(fvSketches, 48000) val trainLab = trainFV.map(_.toLabeledPoint(1.0)) val synthLab = synthFV.map(_.toLabeledPoint(0.0)) val trainFR = trainLab ++ synthLab
  25. 25. Random Forest Feature Reduction {“f1”, “f2”, … }
  26. 26. National Institute of Health (2014) 12 Toxicity Assays 12060 compounds + 647 hold-out https://tripod.nih.gov/tox21/challenge/index.jsp Tox21 Data Challenge
  27. 27. DeepTox Johannes Kepler University Linz Institute of Bioinformatics http://bioinf.jku.at/research/DeepTox/tox21.html [Mayr2016] Mayr, A., Klambauer, G., Unterthiner, T., & Hochreiter, S. (2016). DeepTox: Toxicity Prediction using Deep Learning. Frontiers in Environmental Science, 3:80. [Huang2016] Huang, R., Xia, M., Nguyen, D. T., Zhao, T., Sakamuru, S., Zhao, J., Shahane, S., Rossoshek, A., & Simeonov, A. (2016). Tox21Challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs. Frontiers in Environmental Science, 3:85.
  28. 28. Tox21 Data 801 Dense Features 272K Sparse Features Each assay represented on a different subset +---------------+------+-----+---------+------------+----- | compound|NR.AhR|NR.AR|NR.AR.LBD|NR.Aromatase| ... +---------------+------+-----+---------+------------+----- |NCGC00261900-01| 0| 1| NA| 0| |NCGC00260869-01| 0| 1| NA| NA| . |NCGC00261776-01| 1| 1| 0| NA| . |NCGC00261380-01| NA| 0| NA| 1| . |NCGC00261842-01| 0| 0| 0| NA| |NCGC00261662-01| 1| 0| 0| NA| |NCGC00261190-01| NA| 0| 0| NA| I used these
  29. 29. Experiment Train models on all 12 assays Perform Random Forest Feature Reduction Train similar models on reduced feature set Compare models on each assay
  30. 30. 85 of 801 Features Were Used RNCS 21 MRVSA7 20 VSAEstate2 19 VSAEstate3 18 slogPVSA8 18 VSAEstate0 17 slogPVSA6 16 RDFM29 12 slogPVSA3 12 RDFM30 12 Features Numbertrees used
  31. 31. Full vs Reduced (Logistic Reg)
  32. 32. Full vs Reduced (Boosted DTE)
  33. 33. Full vs Reduced (SVM)
  34. 34. Training Times (times in seconds) Full (801) Reduced (85) Logistic Regression 68.5 46.8 SVM 35.3 33.8 GB Tree Ensemble 247 65.0
  35. 35. Evaluation Times (times in seconds) Full (801) Reduced (85) Logistic Regression 32.1 3.88 SVM 0.59 0.23 GB Tree Ensemble 1.33 0.88
  36. 36. Erik Erlandson eje@redhat.com @manyangled https://github.com/erikerlandson/feature-reduction-talk Thank You

×