Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

1,027 views

Published on

Algorithms for sketching probability distributions from large data sets are a fundamental building block of modern data science. Sketching plays a role in diverse applications ranging from visualization, optimizing data encodings, estimating quantiles, data synthesis and imputation. The T-Digest is a versatile sketching data structure. It operates on any numeric data, models tricky distribution tails with high fidelity, and most crucially it works smoothly with aggregators and map-reduce.

T-Digest is a perfect fit for Apache Spark; it is single-pass and intermediate results can be aggregated across partitions in batch jobs or aggregated across windows in streaming jobs. In this talk I will describe a native Scala implementation of the T-Digest sketching algorithm and demonstrate its use in Spark applications for visualization, quantile estimations and data synthesis.

Attendees of this talk will leave with an understanding of data sketching with T-Digest sketches, and insights about how to apply T-Digest to their own data analysis applications.

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

  1. 1. Erik Erlandson Sketching Data With T-Digest in Apache Spark Red Hat, Inc.
  2. 2. Introduction Erik Erlandson Software Engineer at Red Hat, Inc. Emerging Technologies Group Internal Data Science Insightful Applications
  3. 3. Why Sketching? ● Faster ● Smaller ● Essential Features
  4. 4. We All Sketch Data 3.4 6.0 2.5 ⋮ Mean = 3.97 Variance = 3.30 3.4, 5.0, 9.0 6.0, 2.1, 7.7 2.5, 4.4, 3.2 ⋮
  5. 5. T-Digest • Computing Extremely Accurate Quantiles Using t-Digests • Ted Dunning & Omar Ertl • https://github.com/tdunning/t-digest • Implementations in Java, Python, R, JS, C++ and Scala
  6. 6. What is T-Digest Sketching? 3.4 6.0 2.5 ⋮ (3.4, 3) (6.0, 2) (2.5, 8) ⋮ or Sketch of CDF P(X <= x) X Data Domain
  7. 7. Incremental Updates Current T-Digest + (x, w) = Updated T-Digest Large or Streaming Data Compact “Running” Sketch
  8. 8. The Payoff REST Service Query Latencies What does my latency distribution look like? I want to simulate my latencies! Are 90% of my latencies under 1 second?
  9. 9. Representation clusters Distribution CDF (location, mass) (x, m)
  10. 10. Update (x, m) Nearest Cluster Update location Increment Mass
  11. 11. Cluster Mass Bounds q=0 q=1 C∙M/4 Quantiles q(x) M = (masses) B(x) = C∙M∙q(x)∙(1-q(x)) C = compression
  12. 12. Bounds Force New Clusters (x,m) mc + m? (xc ,mc ) mc + m > B(xc )! (xc ,mc ) (xu ,B(xc )) (x, B(xc )-(mc + m)) (x,m)
  13. 13. Resolution q=0 q=1 More small clusters Fewer Large Clusters
  14. 14. T-Digests are Monoidal C1 ∪ C2 D1 |+| D2 D1 ≡ C1 D2 ≡ C2 C1 ∪ C2 ⟹
  15. 15. Monoidal => Map-Reduce P1 P2 Pn |+| Data in Spark t-digests result Map
  16. 16. 7 |+| - Randomized Order 1 3 5 92 4 86 1110 7 1 3 5 9 24 86 1110D1 |+| D2 ⟸
  17. 17. 7 |+| - Merged Order 1 3 5 92 4 86 1110 7 1 3 5 92 4 86 1110D1 |+| D2 ⟸
  18. 18. 7 |+| - Large to Small 1 3 5 92 4 86 1110 7 1 3 5 924 8 611 10 D1 |+| D2 ⟸
  19. 19. Comparing |+| Definitions
  20. 20. Algorithmic Considerations • Clusters maintained in sorted order by location • Clusters frequently inserted / deleted / updated • Query the cluster nearest to an incoming (x,m) • Given (x,m), query the prefix-sum of cluster mass – (m’), over all clusters (x’,m’) where x’ <= x • Do it all in logarithmic time!
  21. 21. Backed By Balanced Tree
  22. 22. Scala Considerations • Immutable Red/Black Tree • Extends Map and MapLike • Capabilities are Mixable Traits – Red/Black – Ordered – Incrementable-Values – Nearest-Neighbor – Prefix-Sum • Interface to Algebird Monoids & Aggregators
  23. 23. Discrete Distributions If (tdigest.clusters.size <= max_discrete) { // increment by m (or insert new) tdigest.clusters.increment(x, m) } else { // do full t-digest cluster updating algorithm tdigest.update(x, m) } Experim ental
  24. 24. Applications • Quantile Estimation • Feature Data Characterization • Building CoDecs • Value-At-Risk Modeling • Generative Data Models
  25. 25. Demo
  26. 26. Thank You eje@redhat.com @manyangled https://github.com/isarn/isarn-sketches

×