Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Erik Erlandson
Red Hat, Inc.
Smart Scalable Feature
Reduction with Random
Forests
Erik Erlandson
• Software Engineer
• Radanalytics.io community
• Apache Spark on OpenShift
• Intelligent Applications in t...
Talk
• Motivate Feature Reduction
• Random Forest Clustering
• T-Digest Feature Sketching
• RF Feature Reduction
• Example...
Features
2.3 1.0 0.0 1.0 3.1 4.2 6.9 0.0 7.3
Model
Training
Evaluation
R
esults
Measurable
Properties!
Feature Reduction
Full Feature Set
Identify Useful
Features
Reduced
Feature Set
Feature Sets Can Be Very Large
hundreds
thousands
...
millions
Features Cost Resources
Memory
Disk
Network
Time
Energy
Features Inject Noise
Model Training
Without Reduction
With Feature
Reduction
Features Impact Model Size
Model Without Reduction
With Reduction
Representation & Transfer Learning
Random Forests
Leo Breiman (2001)
Ensemble of Decision Tree Models
Each tree trains on random subset of data
Each split co...
Random Forest Clustering
F1
F2
.
.
.
Fm
Features
RF model
34
12
.
.
.
73
Leaf IDs34
12
73
Cluster
these !
Leaf Node IDs
2 Key Benefits of RF Clustering
Features Used
by RF Model
<< Full Feature Set
RF Training ignores unhelpful features
Data with a Joint Distribution in R^2
Data with Synthetic
RF Rules for Data (non-synthetic)
List((x2 <= -1.32), (x1 <= 0.87))
List((x1 > -1.37), (x2 > 1.03))
List((x2 <= 2.09), (x1...
RF Rules in Feature Space
What Features Did the RF Use?
List((x2 <= -1.32), (x1 <= 0.87))
List((x1 > -1.37), (x2 > 1.03))
List((x2 <= 2.09), (x1 <= ...
T-Digest Sketches a Distribution
3.4
6.0
2.5
⋮
Sketch of
CDF
P(X <= x)
X
Data Domain
Inverse Transform Sampling
Sample U[0,1] => q
x where CDF(x) = q
0
1
CDF
T-Digests Can Aggregate
P1
P2
Pn
|+|
Data in Spark t-digests
result
Map
Sketching a Feature
feature.aggregate(TDigest.empty())(
(td, x) => td + x,
(td1, td2) => td1 ++ td2
)
Synthesizing Data from TDigests
def synthesize(tdVec: Vector[TDigest],
n: Int) = {
val tdVecBC = sc.broadcast(tdVec)
sc.pa...
Random Forest Training Data
val fvSketches = sketchFV(trainFV)
val synthFV = synthesize(fvSketches, 48000)
val trainLab = ...
Random Forest Feature Reduction
{“f1”, “f2”, … }
National Institute of Health (2014)
12 Toxicity Assays
12060 compounds + 647 hold-out
https://tripod.nih.gov/tox21/challen...
DeepTox
Johannes Kepler University Linz
Institute of Bioinformatics
http://bioinf.jku.at/research/DeepTox/tox21.html
[Mayr...
Tox21 Data
801 Dense Features
272K Sparse Features
Each assay represented on a different subset
+---------------+------+--...
Experiment
Train models on all 12 assays
Perform Random Forest Feature Reduction
Train similar models on reduced feature s...
85 of 801 Features Were Used
RNCS 21
MRVSA7 20
VSAEstate2 19
VSAEstate3 18
slogPVSA8 18
VSAEstate0 17
slogPVSA6 16
RDFM29 ...
Full vs Reduced (Logistic Reg)
Full vs Reduced (Boosted DTE)
Full vs Reduced (SVM)
Training Times
(times in seconds) Full (801) Reduced (85)
Logistic Regression 68.5 46.8
SVM 35.3 33.8
GB Tree Ensemble 247...
Evaluation Times
(times in seconds) Full (801) Reduced (85)
Logistic Regression 32.1 3.88
SVM 0.59 0.23
GB Tree Ensemble 1...
Erik Erlandson
eje@redhat.com
@manyangled
https://github.com/erikerlandson/feature-reduction-talk
Thank You
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1

Share

Smart Scalable Feature Reduction with Random Forests with Erik Erlandson

Download to read offline

Modern datacenters and IoT networks generate a wide variety of telemetry that makes excellent fodder for machine learning algorithms. Combined with feature extraction and expansion techniques such as word2vec or polynomial expansion, these data yield an embarrassment of riches for learning models and the data scientists who train them. However, these extremely rich feature sets come at a cost. High-dimensional feature spaces almost always include many redundant or noisy dimensions. These low-information features waste space and computation, and reduce the quality of learning models by diluting useful features.

In this talk, Erlandson will describe how Random Forest Clustering identifies useful features in data having many low-quality features, and will demonstrate a feature reduction application using Apache Spark to analyze compute infrastructure telemetry data.

Learn the principles of how Random Forest Clustering solves feature reduction problems, and how you can apply Random Forest tools in Apache Spark to improve your model training scalability, the quality of your models, and your understanding of application domains.

Smart Scalable Feature Reduction with Random Forests with Erik Erlandson

  1. 1. Erik Erlandson Red Hat, Inc. Smart Scalable Feature Reduction with Random Forests
  2. 2. Erik Erlandson • Software Engineer • Radanalytics.io community • Apache Spark on OpenShift • Intelligent Applications in the cloud
  3. 3. Talk • Motivate Feature Reduction • Random Forest Clustering • T-Digest Feature Sketching • RF Feature Reduction • Example: Tox21 Assay Data
  4. 4. Features 2.3 1.0 0.0 1.0 3.1 4.2 6.9 0.0 7.3 Model Training Evaluation R esults Measurable Properties!
  5. 5. Feature Reduction Full Feature Set Identify Useful Features Reduced Feature Set
  6. 6. Feature Sets Can Be Very Large hundreds thousands ... millions
  7. 7. Features Cost Resources Memory Disk Network Time Energy
  8. 8. Features Inject Noise Model Training Without Reduction With Feature Reduction
  9. 9. Features Impact Model Size Model Without Reduction With Reduction
  10. 10. Representation & Transfer Learning
  11. 11. Random Forests Leo Breiman (2001) Ensemble of Decision Tree Models Each tree trains on random subset of data Each split considers random subset of features
  12. 12. Random Forest Clustering F1 F2 . . . Fm Features RF model 34 12 . . . 73 Leaf IDs34 12 73 Cluster these ! Leaf Node IDs
  13. 13. 2 Key Benefits of RF Clustering Features Used by RF Model << Full Feature Set RF Training ignores unhelpful features
  14. 14. Data with a Joint Distribution in R^2
  15. 15. Data with Synthetic
  16. 16. RF Rules for Data (non-synthetic) List((x2 <= -1.32), (x1 <= 0.87)) List((x1 > -1.37), (x2 > 1.03)) List((x2 <= 2.09), (x1 <= 0.87)) List((x1 <= 2.13), (x2 <= -1.32)) List((x2 <= -2.31), (x1 <= 0.87)) X1 > -1.37 X2>1.03
  17. 17. RF Rules in Feature Space
  18. 18. What Features Did the RF Use? List((x2 <= -1.32), (x1 <= 0.87)) List((x1 > -1.37), (x2 > 1.03)) List((x2 <= 2.09), (x1 <= 0.87)) List((x1 <= 2.13), (x2 <= -1.32)) List((x2 <= -2.31), (x1 <= 0.87)) reduced = {“x1”, “x2”} X1 > -1.37 X2>1.03
  19. 19. T-Digest Sketches a Distribution 3.4 6.0 2.5 ⋮ Sketch of CDF P(X <= x) X Data Domain
  20. 20. Inverse Transform Sampling Sample U[0,1] => q x where CDF(x) = q 0 1 CDF
  21. 21. T-Digests Can Aggregate P1 P2 Pn |+| Data in Spark t-digests result Map
  22. 22. Sketching a Feature feature.aggregate(TDigest.empty())( (td, x) => td + x, (td1, td2) => td1 ++ td2 )
  23. 23. Synthesizing Data from TDigests def synthesize(tdVec: Vector[TDigest], n: Int) = { val tdVecBC = sc.broadcast(tdVec) sc.parallelize(1 to n).map { _ => tdVecBC.value.map(_.sample) } }
  24. 24. Random Forest Training Data val fvSketches = sketchFV(trainFV) val synthFV = synthesize(fvSketches, 48000) val trainLab = trainFV.map(_.toLabeledPoint(1.0)) val synthLab = synthFV.map(_.toLabeledPoint(0.0)) val trainFR = trainLab ++ synthLab
  25. 25. Random Forest Feature Reduction {“f1”, “f2”, … }
  26. 26. National Institute of Health (2014) 12 Toxicity Assays 12060 compounds + 647 hold-out https://tripod.nih.gov/tox21/challenge/index.jsp Tox21 Data Challenge
  27. 27. DeepTox Johannes Kepler University Linz Institute of Bioinformatics http://bioinf.jku.at/research/DeepTox/tox21.html [Mayr2016] Mayr, A., Klambauer, G., Unterthiner, T., & Hochreiter, S. (2016). DeepTox: Toxicity Prediction using Deep Learning. Frontiers in Environmental Science, 3:80. [Huang2016] Huang, R., Xia, M., Nguyen, D. T., Zhao, T., Sakamuru, S., Zhao, J., Shahane, S., Rossoshek, A., & Simeonov, A. (2016). Tox21Challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs. Frontiers in Environmental Science, 3:85.
  28. 28. Tox21 Data 801 Dense Features 272K Sparse Features Each assay represented on a different subset +---------------+------+-----+---------+------------+----- | compound|NR.AhR|NR.AR|NR.AR.LBD|NR.Aromatase| ... +---------------+------+-----+---------+------------+----- |NCGC00261900-01| 0| 1| NA| 0| |NCGC00260869-01| 0| 1| NA| NA| . |NCGC00261776-01| 1| 1| 0| NA| . |NCGC00261380-01| NA| 0| NA| 1| . |NCGC00261842-01| 0| 0| 0| NA| |NCGC00261662-01| 1| 0| 0| NA| |NCGC00261190-01| NA| 0| 0| NA| I used these
  29. 29. Experiment Train models on all 12 assays Perform Random Forest Feature Reduction Train similar models on reduced feature set Compare models on each assay
  30. 30. 85 of 801 Features Were Used RNCS 21 MRVSA7 20 VSAEstate2 19 VSAEstate3 18 slogPVSA8 18 VSAEstate0 17 slogPVSA6 16 RDFM29 12 slogPVSA3 12 RDFM30 12 Features Numbertrees used
  31. 31. Full vs Reduced (Logistic Reg)
  32. 32. Full vs Reduced (Boosted DTE)
  33. 33. Full vs Reduced (SVM)
  34. 34. Training Times (times in seconds) Full (801) Reduced (85) Logistic Regression 68.5 46.8 SVM 35.3 33.8 GB Tree Ensemble 247 65.0
  35. 35. Evaluation Times (times in seconds) Full (801) Reduced (85) Logistic Regression 32.1 3.88 SVM 0.59 0.23 GB Tree Ensemble 1.33 0.88
  36. 36. Erik Erlandson eje@redhat.com @manyangled https://github.com/erikerlandson/feature-reduction-talk Thank You
  • ssuser9b27b2

    Jun. 22, 2017

Modern datacenters and IoT networks generate a wide variety of telemetry that makes excellent fodder for machine learning algorithms. Combined with feature extraction and expansion techniques such as word2vec or polynomial expansion, these data yield an embarrassment of riches for learning models and the data scientists who train them. However, these extremely rich feature sets come at a cost. High-dimensional feature spaces almost always include many redundant or noisy dimensions. These low-information features waste space and computation, and reduce the quality of learning models by diluting useful features. In this talk, Erlandson will describe how Random Forest Clustering identifies useful features in data having many low-quality features, and will demonstrate a feature reduction application using Apache Spark to analyze compute infrastructure telemetry data. Learn the principles of how Random Forest Clustering solves feature reduction problems, and how you can apply Random Forest tools in Apache Spark to improve your model training scalability, the quality of your models, and your understanding of application domains.

Views

Total views

770

On Slideshare

0

From embeds

0

Number of embeds

1

Actions

Downloads

29

Shares

0

Comments

0

Likes

1

×