Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson


Published on

The T-Digest has earned a reputation as a highly efficient and versatile sketching data structure; however, its applications as a fast generative model are less appreciated. Several common algorithms from machine learning use randomization of feature columns as a building block. Column randomization is an awkward and expensive operation when performed directly, but when implemented with generative T-Digests, it can be accomplished elegantly in a single pass that also parallelizes across Spark data partitions. In this talk Erik will review the principles of T-Digest sketching, and how T-Digests can be applied as generative models. He will explain how generative T-Digests can be used to implement fast randomization of columnar data, and conclude with demonstrations of T-Digest randomization applied to Variable Importance, Random Forest Clustering and Feature Reduction. Attendees will leave this talk with an understanding of T-Digest sketching, how T-Digests can be used as generative models, and insights into applying generative T-Digests to accelerate their own data science projects.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erlandson

  1. 1. Erik Erlandson, Red Hat One-Pass Data Science in Apache Spark with Generative T-Digests #EUds11
  2. 2. Landscape Features & Feature Randomization 3 Applications T-Digests & Generative Sampling 3 Applications: Reprise Feature Importance Demo
  3. 3. Features 2.3 1.0 0.0 1.0 3.1 4.2 6.9 0.0 7.3 Model Training Evaluation prediction Measurable Properties!
  4. 4. Feature Randomization Preserves Marginals Destroys Joint
  5. 5. Randomization Methods Permutation Selection
  6. 6. Random Forests Leo Breiman (2001) Ensemble of Decision Tree Models Each tree trains on random subset of data Each split considers random subset of features
  7. 7. Random Forest Clustering
  8. 8. Random Forest Clustering Learn Real vs Fake!
  9. 9. Random Forest Clustering F1 F2 . . . Fm Features RF model 34 12 . . . 73 Leaf IDs34 12 73 Cluster these ! Leaf Node IDs
  10. 10. Feature Reduction {“f12”, “f37”, … }
  11. 11. Feature Importance measure change in accuracy im p(1) im p(2) im p(3)
  12. 12. What If Data Is Partitioned?
  13. 13. T-Digest • Computing Extremely Accurate Quantiles Using t-Digests • Ted Dunning & Omar Ertl • • Implementations in Java, Python, R, JS, C++ and Scala • UDAFs packaged for Spark and PySpark
  14. 14. What is T-Digest Sketching? 3.4 6.0 2.5 ⋮ Sketch of CDF P(X <= x) X Data Domain
  15. 15. Incremental Updates Current T-Digest + x = Updated T-Digest Large or Streaming Data Compact “Running” Sketch
  16. 16. T-Digests Can Aggregate P1 P2 Pn |+| Data in Spark t-digests result Map
  17. 17. Inverse Transform Sampling (ITS) Sample U[0,1] => q x where CDF(x) = q 0 1 CDF (x,q) Generative!
  18. 18. Random Selection => ITS Selection Generative Sampling!
  19. 19. RF Clustering & Feature Reduction
  20. 20. Feature Importance measure change in accuracy im p(1) im p(2) im p(3)
  21. 21. Feature Importance 42 ReferenceFeature Vector
  22. 22. Feature Importance 4.5 j t = 4.5 save feature(j)
  23. 23. Feature Importance 3.1 j t = 4.5sample sketch(j)
  24. 24. Feature Importance 3.1 j 43
  25. 25. Feature Importance 3.1 j 43 dev(j) += |43-42| Running sum of deviations
  26. 26. Feature Importance 4.5 j t = 4.5restore feature(j) advance
  27. 27. Sum of Dev ÷ N = Importance dev 1 dev 2 ... dev M ÷ N imp 1 imp 2 ... imp M Im portancesCum ulative Deviations
  28. 28. Deviations can Aggregate P1 P2 Pn Feature Data Map |+| Dev Sums Aggregate Deviations ÷ N Importances
  29. 29. Linear in Samples and Features Single Pass over the Feature Data Parallel over Data Partitions One-Pass Feature Importance
  30. 30. Tox21 Data National Institute of Health (2014) 12 Toxicity Assays, 800 “dense” features 12060 compounds + 647 hold-out Johannes Kepler University Linz [Mayr2016] Mayr, A., Klambauer, G., Unterthiner, T., & Hochreiter, S. (2016). DeepTox: Toxicity Prediction using Deep Learning. Frontiers in Environmental Science, 3:80. [Huang2016] Huang, R., Xia, M., Nguyen, D. T., Zhao, T., Sakamuru, S., Zhao, J., Shahane, S., Rossoshek, A., & Simeonov, A. (2016). Tox21Challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs. Frontiers in Environmental Science, 3:85.
  31. 31. Demo
  32. 32. Explore Building ML Algorithms on Apache Spark Sketching With T-Digests Random Forest Feature Reduction Random Forest Clustering for Spark T-Digests and Feature Importance for Spark Demo Notebook for This Talk
  33. 33. Thank You! #EUds11 @manyangled