Advertisement

Feb. 15, 2017•0 likes## 3 likes

•2,282 views## views

Be the first to like this

Show More

Total views

0

On Slideshare

0

From embeds

0

Number of embeds

0

Download to read offline

Report

Data & Analytics

Algorithms for sketching probability distributions from large data sets are a fundamental building block of modern data science. Sketching plays a role in diverse applications ranging from visualization, optimizing data encodings, estimating quantiles, data synthesis and imputation. The T-Digest is a versatile sketching data structure. It operates on any numeric data, models tricky distribution tails with high fidelity, and most crucially it works smoothly with aggregators and map-reduce. T-Digest is a perfect fit for Apache Spark; it is single-pass and intermediate results can be aggregated across partitions in batch jobs or aggregated across windows in streaming jobs. In this talk I will describe a native Scala implementation of the T-Digest sketching algorithm and demonstrate its use in Spark applications for visualization, quantile estimations and data synthesis. Attendees of this talk will leave with an understanding of data sketching with T-Digest sketches, and insights about how to apply T-Digest to their own data analysis applications.

Spark SummitFollow

Spark SummitAdvertisement

Advertisement

Advertisement

Large Scale Graph Analytics with JanusGraphP. Taylor Goetz

Introduction to Web Scraping using Python and Beautiful SoupTushar Mittal

An Intro to Elasticsearch and KibanaObjectRocket

Numeric Data types in Pythonjyostna bodapati

Introduction To Python | EdurekaEdureka!

Introduction To Programming with PythonSushant Mane

- Erik Erlandson Sketching Data With T-Digest in Apache Spark Red Hat, Inc.
- Introduction Erik Erlandson Software Engineer at Red Hat, Inc. Emerging Technologies Group Internal Data Science Insightful Applications
- Why Sketching? ● Faster ● Smaller ● Essential Features
- We All Sketch Data 3.4 6.0 2.5 ⋮ Mean = 3.97 Variance = 3.30 3.4, 5.0, 9.0 6.0, 2.1, 7.7 2.5, 4.4, 3.2 ⋮
- T-Digest • Computing Extremely Accurate Quantiles Using t-Digests • Ted Dunning & Omar Ertl • https://github.com/tdunning/t-digest • Implementations in Java, Python, R, JS, C++ and Scala
- What is T-Digest Sketching? 3.4 6.0 2.5 ⋮ (3.4, 3) (6.0, 2) (2.5, 8) ⋮ or Sketch of CDF P(X <= x) X Data Domain
- Incremental Updates Current T-Digest + (x, w) = Updated T-Digest Large or Streaming Data Compact “Running” Sketch
- The Payoff REST Service Query Latencies What does my latency distribution look like? I want to simulate my latencies! Are 90% of my latencies under 1 second?
- Representation clusters Distribution CDF (location, mass) (x, m)
- Update (x, m) Nearest Cluster Update location Increment Mass
- Cluster Mass Bounds q=0 q=1 C∙M/4 Quantiles q(x) M = (masses) B(x) = C∙M∙q(x)∙(1-q(x)) C = compression
- Bounds Force New Clusters (x,m) mc + m? (xc ,mc ) mc + m > B(xc )! (xc ,mc ) (xu ,B(xc )) (x, B(xc )-(mc + m)) (x,m)
- Resolution q=0 q=1 More small clusters Fewer Large Clusters
- T-Digests are Monoidal C1 ∪ C2 D1 |+| D2 D1 ≡ C1 D2 ≡ C2 C1 ∪ C2 ⟹
- Monoidal => Map-Reduce P1 P2 Pn |+| Data in Spark t-digests result Map
- 7 |+| - Randomized Order 1 3 5 92 4 86 1110 7 1 3 5 9 24 86 1110D1 |+| D2 ⟸
- 7 |+| - Merged Order 1 3 5 92 4 86 1110 7 1 3 5 92 4 86 1110D1 |+| D2 ⟸
- 7 |+| - Large to Small 1 3 5 92 4 86 1110 7 1 3 5 924 8 611 10 D1 |+| D2 ⟸
- Comparing |+| Definitions
- Algorithmic Considerations • Clusters maintained in sorted order by location • Clusters frequently inserted / deleted / updated • Query the cluster nearest to an incoming (x,m) • Given (x,m), query the prefix-sum of cluster mass – (m’), over all clusters (x’,m’) where x’ <= x • Do it all in logarithmic time!
- Backed By Balanced Tree
- Scala Considerations • Immutable Red/Black Tree • Extends Map and MapLike • Capabilities are Mixable Traits – Red/Black – Ordered – Incrementable-Values – Nearest-Neighbor – Prefix-Sum • Interface to Algebird Monoids & Aggregators
- Discrete Distributions If (tdigest.clusters.size <= max_discrete) { // increment by m (or insert new) tdigest.clusters.increment(x, m) } else { // do full t-digest cluster updating algorithm tdigest.update(x, m) } Experim ental
- Applications • Quantile Estimation • Feature Data Characterization • Building CoDecs • Value-At-Risk Modeling • Generative Data Models
- Demo
- Thank You eje@redhat.com @manyangled https://github.com/isarn/isarn-sketches

Advertisement