Report

Share

Follow

•2 likes•883 views

•2 likes•883 views

Report

Share

Download to read offline

Modern datacenters and IoT networks generate a wide variety of telemetry that makes excellent fodder for machine learning algorithms. Combined with feature extraction and expansion techniques such as word2vec or polynomial expansion, these data yield an embarrassment of riches for learning models and the data scientists who train them. However, these extremely rich feature sets come at a cost. High-dimensional feature spaces almost always include many redundant or noisy dimensions. These low-information features waste space and computation, and reduce the quality of learning models by diluting useful features. In this talk, Erlandson will describe how Random Forest Clustering identifies useful features in data having many low-quality features, and will demonstrate a feature reduction application using Apache Spark to analyze compute infrastructure telemetry data. Learn the principles of how Random Forest Clustering solves feature reduction problems, and how you can apply Random Forest tools in Apache Spark to improve your model training scalability, the quality of your models, and your understanding of application domains.

Follow

- 1. Erik Erlandson Red Hat, Inc. Smart Scalable Feature Reduction with Random Forests
- 2. Erik Erlandson • Software Engineer • Radanalytics.io community • Apache Spark on OpenShift • Intelligent Applications in the cloud
- 3. Talk • Motivate Feature Reduction • Random Forest Clustering • T-Digest Feature Sketching • RF Feature Reduction • Example: Tox21 Assay Data
- 4. Features 2.3 1.0 0.0 1.0 3.1 4.2 6.9 0.0 7.3 Model Training Evaluation R esults Measurable Properties!
- 5. Feature Reduction Full Feature Set Identify Useful Features Reduced Feature Set
- 6. Feature Sets Can Be Very Large hundreds thousands ... millions
- 8. Features Inject Noise Model Training Without Reduction With Feature Reduction
- 9. Features Impact Model Size Model Without Reduction With Reduction
- 10. Representation & Transfer Learning
- 11. Random Forests Leo Breiman (2001) Ensemble of Decision Tree Models Each tree trains on random subset of data Each split considers random subset of features
- 12. Random Forest Clustering F1 F2 . . . Fm Features RF model 34 12 . . . 73 Leaf IDs34 12 73 Cluster these ! Leaf Node IDs
- 13. 2 Key Benefits of RF Clustering Features Used by RF Model << Full Feature Set RF Training ignores unhelpful features
- 14. Data with a Joint Distribution in R^2
- 16. RF Rules for Data (non-synthetic) List((x2 <= -1.32), (x1 <= 0.87)) List((x1 > -1.37), (x2 > 1.03)) List((x2 <= 2.09), (x1 <= 0.87)) List((x1 <= 2.13), (x2 <= -1.32)) List((x2 <= -2.31), (x1 <= 0.87)) X1 > -1.37 X2>1.03
- 17. RF Rules in Feature Space
- 18. What Features Did the RF Use? List((x2 <= -1.32), (x1 <= 0.87)) List((x1 > -1.37), (x2 > 1.03)) List((x2 <= 2.09), (x1 <= 0.87)) List((x1 <= 2.13), (x2 <= -1.32)) List((x2 <= -2.31), (x1 <= 0.87)) reduced = {“x1”, “x2”} X1 > -1.37 X2>1.03
- 19. T-Digest Sketches a Distribution 3.4 6.0 2.5 ⋮ Sketch of CDF P(X <= x) X Data Domain
- 20. Inverse Transform Sampling Sample U[0,1] => q x where CDF(x) = q 0 1 CDF
- 21. T-Digests Can Aggregate P1 P2 Pn |+| Data in Spark t-digests result Map
- 22. Sketching a Feature feature.aggregate(TDigest.empty())( (td, x) => td + x, (td1, td2) => td1 ++ td2 )
- 23. Synthesizing Data from TDigests def synthesize(tdVec: Vector[TDigest], n: Int) = { val tdVecBC = sc.broadcast(tdVec) sc.parallelize(1 to n).map { _ => tdVecBC.value.map(_.sample) } }
- 24. Random Forest Training Data val fvSketches = sketchFV(trainFV) val synthFV = synthesize(fvSketches, 48000) val trainLab = trainFV.map(_.toLabeledPoint(1.0)) val synthLab = synthFV.map(_.toLabeledPoint(0.0)) val trainFR = trainLab ++ synthLab
- 25. Random Forest Feature Reduction {“f1”, “f2”, … }
- 26. National Institute of Health (2014) 12 Toxicity Assays 12060 compounds + 647 hold-out https://tripod.nih.gov/tox21/challenge/index.jsp Tox21 Data Challenge
- 27. DeepTox Johannes Kepler University Linz Institute of Bioinformatics http://bioinf.jku.at/research/DeepTox/tox21.html [Mayr2016] Mayr, A., Klambauer, G., Unterthiner, T., & Hochreiter, S. (2016). DeepTox: Toxicity Prediction using Deep Learning. Frontiers in Environmental Science, 3:80. [Huang2016] Huang, R., Xia, M., Nguyen, D. T., Zhao, T., Sakamuru, S., Zhao, J., Shahane, S., Rossoshek, A., & Simeonov, A. (2016). Tox21Challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs. Frontiers in Environmental Science, 3:85.
- 28. Tox21 Data 801 Dense Features 272K Sparse Features Each assay represented on a different subset +---------------+------+-----+---------+------------+----- | compound|NR.AhR|NR.AR|NR.AR.LBD|NR.Aromatase| ... +---------------+------+-----+---------+------------+----- |NCGC00261900-01| 0| 1| NA| 0| |NCGC00260869-01| 0| 1| NA| NA| . |NCGC00261776-01| 1| 1| 0| NA| . |NCGC00261380-01| NA| 0| NA| 1| . |NCGC00261842-01| 0| 0| 0| NA| |NCGC00261662-01| 1| 0| 0| NA| |NCGC00261190-01| NA| 0| 0| NA| I used these
- 29. Experiment Train models on all 12 assays Perform Random Forest Feature Reduction Train similar models on reduced feature set Compare models on each assay
- 30. 85 of 801 Features Were Used RNCS 21 MRVSA7 20 VSAEstate2 19 VSAEstate3 18 slogPVSA8 18 VSAEstate0 17 slogPVSA6 16 RDFM29 12 slogPVSA3 12 RDFM30 12 Features Numbertrees used
- 31. Full vs Reduced (Logistic Reg)
- 32. Full vs Reduced (Boosted DTE)
- 33. Full vs Reduced (SVM)
- 34. Training Times (times in seconds) Full (801) Reduced (85) Logistic Regression 68.5 46.8 SVM 35.3 33.8 GB Tree Ensemble 247 65.0
- 35. Evaluation Times (times in seconds) Full (801) Reduced (85) Logistic Regression 32.1 3.88 SVM 0.59 0.23 GB Tree Ensemble 1.33 0.88