Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Erik Erlandson, Red Hat
One-Pass Data Science
in
Apache Spark
with
Generative T-Digests
#EUds11
Landscape
Features & Feature Randomization
3 Applications
T-Digests & Generative Sampling
3 Applications: Reprise
Feature ...
Features
2.3 1.0 0.0 1.0 3.1 4.2 6.9 0.0 7.3
Model
Training
Evaluation
prediction
Measurable
Properties!
Feature Randomization
Preserves
Marginals
Destroys
Joint
Randomization Methods
Permutation Selection
Random Forests
Leo Breiman (2001)
Ensemble of Decision Tree Models
Each tree trains on random subset of data
Each split co...
Random Forest Clustering
Random Forest Clustering
Learn Real vs Fake!
Random Forest Clustering
F1
F2
.
.
.
Fm
Features
RF model
34
12
.
.
.
73
Leaf IDs34
12
73
Cluster
these !
Leaf Node IDs
Feature Reduction
{“f12”, “f37”, … }
Feature Importance
measure change
in accuracy
im
p(1)
im
p(2)
im
p(3)
What If Data Is Partitioned?
T-Digest
• Computing Extremely Accurate Quantiles
Using t-Digests
• Ted Dunning & Omar Ertl
• https://github.com/tdunning/...
What is T-Digest Sketching?
3.4
6.0
2.5
⋮
Sketch of
CDF
P(X <= x)
X
Data Domain
Incremental Updates
Current
T-Digest
+ x = Updated
T-Digest
Large or
Streaming Data
Compact
“Running”
Sketch
T-Digests Can Aggregate
P1
P2
Pn
|+|
Data in Spark t-digests
result
Map
Inverse Transform Sampling (ITS)
Sample U[0,1] => q
x where CDF(x) = q
0
1 CDF
(x,q)
Generative!
Random Selection => ITS
Selection
Generative
Sampling!
RF Clustering & Feature Reduction
Feature Importance
measure change
in accuracy
im
p(1)
im
p(2)
im
p(3)
Feature Importance
42
ReferenceFeature
Vector
Feature Importance
4.5
j
t = 4.5
save
feature(j)
Feature Importance
3.1
j
t = 4.5sample
sketch(j)
Feature Importance
3.1
j
43
Feature Importance
3.1
j
43
dev(j) += |43-42|
Running
sum
of
deviations
Feature Importance
4.5
j
t = 4.5restore
feature(j)
advance
Sum of Dev ÷ N = Importance
dev
1
dev
2 ...
dev
M
÷ N
imp
1
imp
2 ...
imp
M
Im
portancesCum
ulative
Deviations
Deviations can Aggregate
P1
P2
Pn
Feature
Data
Map |+|
Dev Sums
Aggregate
Deviations
÷ N
Importances
Linear in Samples and Features
Single Pass over the Feature Data
Parallel over Data Partitions
One-Pass Feature Importance
Tox21 Data
National Institute of Health (2014)
12 Toxicity Assays, 800 “dense” features
12060 compounds + 647 hold-out
htt...
Demo
Explore
Building ML Algorithms on Apache Spark
Sketching With T-Digests
Random Forest Feature Reduction
Random Forest Clus...
Thank You!
#EUds11
eje@redhat.com
@manyangled
https://github.com/isarn/isarn-sketches-spark
Upcoming SlideShare
Loading in …5
×

of

One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 1 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 2 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 3 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 4 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 5 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 6 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 7 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 8 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 9 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 10 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 11 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 12 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 13 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 14 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 15 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 16 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 17 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 18 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 19 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 20 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 21 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 22 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 23 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 24 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 25 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 26 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 27 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 28 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 29 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 30 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 31 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 32 One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson Slide 33
Upcoming SlideShare
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob Keevil
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson

Download to read offline

The T-Digest has earned a reputation as a highly efficient and versatile sketching data structure; however, its applications as a fast generative model are less appreciated. Several common algorithms from machine learning use randomization of feature columns as a building block. Column randomization is an awkward and expensive operation when performed directly, but when implemented with generative T-Digests, it can be accomplished elegantly in a single pass that also parallelizes across Spark data partitions. In this talk Erik will review the principles of T-Digest sketching, and how T-Digests can be applied as generative models. He will explain how generative T-Digests can be used to implement fast randomization of columnar data, and conclude with demonstrations of T-Digest randomization applied to Variable Importance, Random Forest Clustering and Feature Reduction. Attendees will leave this talk with an understanding of T-Digest sketching, how T-Digests can be used as generative models, and insights into applying generative T-Digests to accelerate their own data science projects.

  • Be the first to like this

One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson

  1. 1. Erik Erlandson, Red Hat One-Pass Data Science in Apache Spark with Generative T-Digests #EUds11
  2. 2. Landscape Features & Feature Randomization 3 Applications T-Digests & Generative Sampling 3 Applications: Reprise Feature Importance Demo
  3. 3. Features 2.3 1.0 0.0 1.0 3.1 4.2 6.9 0.0 7.3 Model Training Evaluation prediction Measurable Properties!
  4. 4. Feature Randomization Preserves Marginals Destroys Joint
  5. 5. Randomization Methods Permutation Selection
  6. 6. Random Forests Leo Breiman (2001) Ensemble of Decision Tree Models Each tree trains on random subset of data Each split considers random subset of features
  7. 7. Random Forest Clustering
  8. 8. Random Forest Clustering Learn Real vs Fake!
  9. 9. Random Forest Clustering F1 F2 . . . Fm Features RF model 34 12 . . . 73 Leaf IDs34 12 73 Cluster these ! Leaf Node IDs
  10. 10. Feature Reduction {“f12”, “f37”, … }
  11. 11. Feature Importance measure change in accuracy im p(1) im p(2) im p(3)
  12. 12. What If Data Is Partitioned?
  13. 13. T-Digest • Computing Extremely Accurate Quantiles Using t-Digests • Ted Dunning & Omar Ertl • https://github.com/tdunning/t-digest • Implementations in Java, Python, R, JS, C++ and Scala • UDAFs packaged for Spark and PySpark
  14. 14. What is T-Digest Sketching? 3.4 6.0 2.5 ⋮ Sketch of CDF P(X <= x) X Data Domain
  15. 15. Incremental Updates Current T-Digest + x = Updated T-Digest Large or Streaming Data Compact “Running” Sketch
  16. 16. T-Digests Can Aggregate P1 P2 Pn |+| Data in Spark t-digests result Map
  17. 17. Inverse Transform Sampling (ITS) Sample U[0,1] => q x where CDF(x) = q 0 1 CDF (x,q) Generative!
  18. 18. Random Selection => ITS Selection Generative Sampling!
  19. 19. RF Clustering & Feature Reduction
  20. 20. Feature Importance measure change in accuracy im p(1) im p(2) im p(3)
  21. 21. Feature Importance 42 ReferenceFeature Vector
  22. 22. Feature Importance 4.5 j t = 4.5 save feature(j)
  23. 23. Feature Importance 3.1 j t = 4.5sample sketch(j)
  24. 24. Feature Importance 3.1 j 43
  25. 25. Feature Importance 3.1 j 43 dev(j) += |43-42| Running sum of deviations
  26. 26. Feature Importance 4.5 j t = 4.5restore feature(j) advance
  27. 27. Sum of Dev ÷ N = Importance dev 1 dev 2 ... dev M ÷ N imp 1 imp 2 ... imp M Im portancesCum ulative Deviations
  28. 28. Deviations can Aggregate P1 P2 Pn Feature Data Map |+| Dev Sums Aggregate Deviations ÷ N Importances
  29. 29. Linear in Samples and Features Single Pass over the Feature Data Parallel over Data Partitions One-Pass Feature Importance
  30. 30. Tox21 Data National Institute of Health (2014) 12 Toxicity Assays, 800 “dense” features 12060 compounds + 647 hold-out https://tripod.nih.gov/tox21/challenge/index.jsp Johannes Kepler University Linz http://bioinf.jku.at/research/DeepTox/tox21.html [Mayr2016] Mayr, A., Klambauer, G., Unterthiner, T., & Hochreiter, S. (2016). DeepTox: Toxicity Prediction using Deep Learning. Frontiers in Environmental Science, 3:80. [Huang2016] Huang, R., Xia, M., Nguyen, D. T., Zhao, T., Sakamuru, S., Zhao, J., Shahane, S., Rossoshek, A., & Simeonov, A. (2016). Tox21Challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs. Frontiers in Environmental Science, 3:85.
  31. 31. Demo
  32. 32. Explore Building ML Algorithms on Apache Spark Sketching With T-Digests Random Forest Feature Reduction Random Forest Clustering for Spark T-Digests and Feature Importance for Spark Demo Notebook for This Talk
  33. 33. Thank You! #EUds11 eje@redhat.com @manyangled https://github.com/isarn/isarn-sketches-spark

The T-Digest has earned a reputation as a highly efficient and versatile sketching data structure; however, its applications as a fast generative model are less appreciated. Several common algorithms from machine learning use randomization of feature columns as a building block. Column randomization is an awkward and expensive operation when performed directly, but when implemented with generative T-Digests, it can be accomplished elegantly in a single pass that also parallelizes across Spark data partitions. In this talk Erik will review the principles of T-Digest sketching, and how T-Digests can be applied as generative models. He will explain how generative T-Digests can be used to implement fast randomization of columnar data, and conclude with demonstrations of T-Digest randomization applied to Variable Importance, Random Forest Clustering and Feature Reduction. Attendees will leave this talk with an understanding of T-Digest sketching, how T-Digests can be used as generative models, and insights into applying generative T-Digests to accelerate their own data science projects.

Views

Total views

757

On Slideshare

0

From embeds

0

Number of embeds

28

Actions

Downloads

29

Shares

0

Comments

0

Likes

0

×