Advertisement
Advertisement

More Related Content

Viewers also liked(20)

Advertisement

Similar to Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson(20)

More from Spark Summit(20)

Advertisement

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

  1. Erik Erlandson Sketching Data With T-Digest in Apache Spark Red Hat, Inc.
  2. Introduction Erik Erlandson Software Engineer at Red Hat, Inc. Emerging Technologies Group Internal Data Science Insightful Applications
  3. Why Sketching? ● Faster ● Smaller ● Essential Features
  4. We All Sketch Data 3.4 6.0 2.5 ⋮ Mean = 3.97 Variance = 3.30 3.4, 5.0, 9.0 6.0, 2.1, 7.7 2.5, 4.4, 3.2 ⋮
  5. T-Digest • Computing Extremely Accurate Quantiles Using t-Digests • Ted Dunning & Omar Ertl • https://github.com/tdunning/t-digest • Implementations in Java, Python, R, JS, C++ and Scala
  6. What is T-Digest Sketching? 3.4 6.0 2.5 ⋮ (3.4, 3) (6.0, 2) (2.5, 8) ⋮ or Sketch of CDF P(X <= x) X Data Domain
  7. Incremental Updates Current T-Digest + (x, w) = Updated T-Digest Large or Streaming Data Compact “Running” Sketch
  8. The Payoff REST Service Query Latencies What does my latency distribution look like? I want to simulate my latencies! Are 90% of my latencies under 1 second?
  9. Representation clusters Distribution CDF (location, mass) (x, m)
  10. Update (x, m) Nearest Cluster Update location Increment Mass
  11. Cluster Mass Bounds q=0 q=1 C∙M/4 Quantiles q(x) M = (masses) B(x) = C∙M∙q(x)∙(1-q(x)) C = compression
  12. Bounds Force New Clusters (x,m) mc + m? (xc ,mc ) mc + m > B(xc )! (xc ,mc ) (xu ,B(xc )) (x, B(xc )-(mc + m)) (x,m)
  13. Resolution q=0 q=1 More small clusters Fewer Large Clusters
  14. T-Digests are Monoidal C1 ∪ C2 D1 |+| D2 D1 ≡ C1 D2 ≡ C2 C1 ∪ C2 ⟹
  15. Monoidal => Map-Reduce P1 P2 Pn |+| Data in Spark t-digests result Map
  16. 7 |+| - Randomized Order 1 3 5 92 4 86 1110 7 1 3 5 9 24 86 1110D1 |+| D2 ⟸
  17. 7 |+| - Merged Order 1 3 5 92 4 86 1110 7 1 3 5 92 4 86 1110D1 |+| D2 ⟸
  18. 7 |+| - Large to Small 1 3 5 92 4 86 1110 7 1 3 5 924 8 611 10 D1 |+| D2 ⟸
  19. Comparing |+| Definitions
  20. Algorithmic Considerations • Clusters maintained in sorted order by location • Clusters frequently inserted / deleted / updated • Query the cluster nearest to an incoming (x,m) • Given (x,m), query the prefix-sum of cluster mass – (m’), over all clusters (x’,m’) where x’ <= x • Do it all in logarithmic time!
  21. Backed By Balanced Tree
  22. Scala Considerations • Immutable Red/Black Tree • Extends Map and MapLike • Capabilities are Mixable Traits – Red/Black – Ordered – Incrementable-Values – Nearest-Neighbor – Prefix-Sum • Interface to Algebird Monoids & Aggregators
  23. Discrete Distributions If (tdigest.clusters.size <= max_discrete) { // increment by m (or insert new) tdigest.clusters.increment(x, m) } else { // do full t-digest cluster updating algorithm tdigest.update(x, m) } Experim ental
  24. Applications • Quantile Estimation • Feature Data Characterization • Building CoDecs • Value-At-Risk Modeling • Generative Data Models
  25. Demo
  26. Thank You eje@redhat.com @manyangled https://github.com/isarn/isarn-sketches
Advertisement