Successfully reported this slideshow.

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

3

Share

1 of 26
1 of 26

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

3

Share

Download to read offline

Algorithms for sketching probability distributions from large data sets are a fundamental building block of modern data science. Sketching plays a role in diverse applications ranging from visualization, optimizing data encodings, estimating quantiles, data synthesis and imputation. The T-Digest is a versatile sketching data structure. It operates on any numeric data, models tricky distribution tails with high fidelity, and most crucially it works smoothly with aggregators and map-reduce.

T-Digest is a perfect fit for Apache Spark; it is single-pass and intermediate results can be aggregated across partitions in batch jobs or aggregated across windows in streaming jobs. In this talk I will describe a native Scala implementation of the T-Digest sketching algorithm and demonstrate its use in Spark applications for visualization, quantile estimations and data synthesis.

Attendees of this talk will leave with an understanding of data sketching with T-Digest sketches, and insights about how to apply T-Digest to their own data analysis applications.

Algorithms for sketching probability distributions from large data sets are a fundamental building block of modern data science. Sketching plays a role in diverse applications ranging from visualization, optimizing data encodings, estimating quantiles, data synthesis and imputation. The T-Digest is a versatile sketching data structure. It operates on any numeric data, models tricky distribution tails with high fidelity, and most crucially it works smoothly with aggregators and map-reduce.

T-Digest is a perfect fit for Apache Spark; it is single-pass and intermediate results can be aggregated across partitions in batch jobs or aggregated across windows in streaming jobs. In this talk I will describe a native Scala implementation of the T-Digest sketching algorithm and demonstrate its use in Spark applications for visualization, quantile estimations and data synthesis.

Attendees of this talk will leave with an understanding of data sketching with T-Digest sketches, and insights about how to apply T-Digest to their own data analysis applications.

More Related Content

More from Spark Summit

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

  1. 1. Erik Erlandson Sketching Data With T-Digest in Apache Spark Red Hat, Inc.
  2. 2. Introduction Erik Erlandson Software Engineer at Red Hat, Inc. Emerging Technologies Group Internal Data Science Insightful Applications
  3. 3. Why Sketching? ● Faster ● Smaller ● Essential Features
  4. 4. We All Sketch Data 3.4 6.0 2.5 ⋮ Mean = 3.97 Variance = 3.30 3.4, 5.0, 9.0 6.0, 2.1, 7.7 2.5, 4.4, 3.2 ⋮
  5. 5. T-Digest • Computing Extremely Accurate Quantiles Using t-Digests • Ted Dunning & Omar Ertl • https://github.com/tdunning/t-digest • Implementations in Java, Python, R, JS, C++ and Scala
  6. 6. What is T-Digest Sketching? 3.4 6.0 2.5 ⋮ (3.4, 3) (6.0, 2) (2.5, 8) ⋮ or Sketch of CDF P(X <= x) X Data Domain
  7. 7. Incremental Updates Current T-Digest + (x, w) = Updated T-Digest Large or Streaming Data Compact “Running” Sketch
  8. 8. The Payoff REST Service Query Latencies What does my latency distribution look like? I want to simulate my latencies! Are 90% of my latencies under 1 second?
  9. 9. Representation clusters Distribution CDF (location, mass) (x, m)
  10. 10. Update (x, m) Nearest Cluster Update location Increment Mass
  11. 11. Cluster Mass Bounds q=0 q=1 C∙M/4 Quantiles q(x) M = (masses) B(x) = C∙M∙q(x)∙(1-q(x)) C = compression
  12. 12. Bounds Force New Clusters (x,m) mc + m? (xc ,mc ) mc + m > B(xc )! (xc ,mc ) (xu ,B(xc )) (x, B(xc )-(mc + m)) (x,m)
  13. 13. Resolution q=0 q=1 More small clusters Fewer Large Clusters
  14. 14. T-Digests are Monoidal C1 ∪ C2 D1 |+| D2 D1 ≡ C1 D2 ≡ C2 C1 ∪ C2 ⟹
  15. 15. Monoidal => Map-Reduce P1 P2 Pn |+| Data in Spark t-digests result Map
  16. 16. 7 |+| - Randomized Order 1 3 5 92 4 86 1110 7 1 3 5 9 24 86 1110D1 |+| D2 ⟸
  17. 17. 7 |+| - Merged Order 1 3 5 92 4 86 1110 7 1 3 5 92 4 86 1110D1 |+| D2 ⟸
  18. 18. 7 |+| - Large to Small 1 3 5 92 4 86 1110 7 1 3 5 924 8 611 10 D1 |+| D2 ⟸
  19. 19. Comparing |+| Definitions
  20. 20. Algorithmic Considerations • Clusters maintained in sorted order by location • Clusters frequently inserted / deleted / updated • Query the cluster nearest to an incoming (x,m) • Given (x,m), query the prefix-sum of cluster mass – (m’), over all clusters (x’,m’) where x’ <= x • Do it all in logarithmic time!
  21. 21. Backed By Balanced Tree
  22. 22. Scala Considerations • Immutable Red/Black Tree • Extends Map and MapLike • Capabilities are Mixable Traits – Red/Black – Ordered – Incrementable-Values – Nearest-Neighbor – Prefix-Sum • Interface to Algebird Monoids & Aggregators
  23. 23. Discrete Distributions If (tdigest.clusters.size <= max_discrete) { // increment by m (or insert new) tdigest.clusters.increment(x, m) } else { // do full t-digest cluster updating algorithm tdigest.update(x, m) } Experim ental
  24. 24. Applications • Quantile Estimation • Feature Data Characterization • Building CoDecs • Value-At-Risk Modeling • Generative Data Models
  25. 25. Demo
  26. 26. Thank You eje@redhat.com @manyangled https://github.com/isarn/isarn-sketches

×