Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

•

3 likes•2,356 views

Algorithms for sketching probability distributions from large data sets are a fundamental building block of modern data science. Sketching plays a role in diverse applications ranging from visualization, optimizing data encodings, estimating quantiles, data synthesis and imputation. The T-Digest is a versatile sketching data structure. It operates on any numeric data, models tricky distribution tails with high fidelity, and most crucially it works smoothly with aggregators and map-reduce. T-Digest is a perfect fit for Apache Spark; it is single-pass and intermediate results can be aggregated across partitions in batch jobs or aggregated across windows in streaming jobs. In this talk I will describe a native Scala implementation of the T-Digest sketching algorithm and demonstrate its use in Spark applications for visualization, quantile estimations and data synthesis. Attendees of this talk will leave with an understanding of data sketching with T-Digest sketches, and insights about how to apply T-Digest to their own data analysis applications.

Data & Analytics

Erik Erlandson
Sketching Data With
T-Digest in Apache Spark
Red Hat, Inc.

Introduction
Erik Erlandson
Software Engineer at Red Hat, Inc.
Emerging Technologies Group
Internal Data Science
Insightful Applications

Why Sketching?
● Faster
● Smaller
● Essential Features

We All Sketch Data
3.4
6.0
2.5
⋮
Mean = 3.97
Variance = 3.30
3.4, 5.0, 9.0
6.0, 2.1, 7.7
2.5, 4.4, 3.2
⋮

T-Digest
• Computing Extremely Accurate Quantiles Using
t-Digests
• Ted Dunning & Omar Ertl
• https://github.com/tdunning/t-digest
• Implementations in Java, Python, R, JS, C++
and Scala

What is T-Digest Sketching?
3.4
6.0
2.5
⋮
(3.4, 3)
(6.0, 2)
(2.5, 8)
⋮
or
Sketch of
CDF
P(X <= x)
X
Data Domain

Incremental Updates
Current
T-Digest
+ (x, w) = Updated
T-Digest
Large or
Streaming Data
Compact
“Running”
Sketch

The Payoff
REST
Service
Query
Latencies
What does my
latency distribution
look like?
I want to simulate
my latencies!
Are 90% of my
latencies under 1
second?

Representation
clusters
Distribution
CDF
(location, mass)
(x, m)

Update
(x, m)
Nearest
Cluster
Update
location
Increment
Mass

Cluster Mass Bounds
q=0 q=1
C∙M/4
Quantiles q(x)
M =
(masses)
B(x) =
C∙M∙q(x)∙(1-q(x))
C =
compression

Bounds Force New Clusters
(x,m)
mc
+ m?
(xc
,mc
)
mc
+ m > B(xc
)!
(xc
,mc
) (xu
,B(xc
))
(x, B(xc
)-(mc
+ m))
(x,m)

Resolution
q=0 q=1
More small
clusters
Fewer Large
Clusters

T-Digests are Monoidal
C1
∪ C2
D1
|+| D2
D1
≡ C1
D2
≡ C2
C1
∪ C2
⟹

Monoidal => Map-Reduce
P1
P2
Pn
|+|
Data in Spark t-digests
result
Map

7
|+| - Randomized Order
1
3
5
92 4
86 1110
7
1
3
5
9 24
86 1110D1
|+| D2
⟸

7
|+| - Merged Order
1
3
5
92 4
86 1110
7
1
3
5
92 4
86 1110D1
|+| D2
⟸

7
|+| - Large to Small
1
3
5
92 4
86 1110
7
1
3
5
924
8 611 10
D1
|+| D2
⟸

Algorithmic Considerations
• Clusters maintained in sorted order by location
• Clusters frequently inserted / deleted / updated
• Query the cluster nearest to an incoming (x,m)
• Given (x,m), query the prefix-sum of cluster mass
– (m’), over all clusters (x’,m’) where x’ <= x
• Do it all in logarithmic time!

Scala Considerations
• Immutable Red/Black Tree
• Extends Map and MapLike
• Capabilities are Mixable Traits
– Red/Black
– Ordered
– Incrementable-Values
– Nearest-Neighbor
– Prefix-Sum
• Interface to Algebird Monoids & Aggregators

$Discrete Distributions If (tdigest.clusters.size <= max_discrete) { // increment by m (or insert new) tdigest.clusters.increment(x, m) } else { // do full t-digest cluster updating algorithm tdigest.update(x, m) } Experim ental$

Applications
• Quantile Estimation
• Feature Data Characterization
• Building CoDecs
• Value-At-Risk Modeling
• Generative Data Models

Thank You
eje@redhat.com
@manyangled
https://github.com/isarn/isarn-sketches

What's hot

Prometheus – a next-gen Monitoring SystemFabian Reinartz

SQLite in Flutter.pptxNabin Dhakal

Spark architectureGauravBiswas9

Introduction to Apache Flinkdatamantra

DataDay 2023 PresentationMax De Marzi

Introduction to hadoop and hdfsshrey mehrotra

Big Data Fabric 2.0 Drives Data DemocratizationCambridge Semantics

Introduction to Knowledge Graphs and Semantic AISemantic Web Company

Apache hiveVaibhav Kadu

Introduction to Spark (Intern Event Presentation)Databricks

Amazon EMR Deep Dive & Best PracticesAmazon Web Services

Understanding InfluxDB Basics: Tags, Fields and MeasurementsInfluxData

Introduction to Apache StormTiziano De Matteis

InfluxDbGuamaral Vasil

Introduction to Spark Streamingdatamantra

Hadoop Mapreduce Job Execution By Ravi Namboori BabsonRavi namboori

Introduction to ML with Apache Spark MLlibTaras Matyashovsky

Vector Search for Data Scientists.pdfConnorShorten2

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward

What's hot (20)

Prometheus – a next-gen Monitoring System

SQLite in Flutter.pptx

Spark architecture

Introduction to Apache Flink

DataDay 2023 Presentation

Introduction to hadoop and hdfs

Big Data Fabric 2.0 Drives Data Democratization

Introduction to Knowledge Graphs and Semantic AI

Apache hive

Introduction to Spark (Intern Event Presentation)

Amazon EMR Deep Dive & Best Practices

Understanding InfluxDB Basics: Tags, Fields and Measurements

Introduction to Apache Storm

InfluxDb

Introduction to Spark Streaming

Hadoop Mapreduce Job Execution By Ravi Namboori Babson

Introduction to ML with Apache Spark MLlib

Vector Search for Data Scientists.pdf

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Dongwon Kim – A Comparative Performance Evaluation of Flink

Viewers also liked

Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Spark Summit

IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...Spark Summit

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...Spark Summit

Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit

Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit

R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...Spark Summit

Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit

Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Spark Summit

Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit

Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit

Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit

Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Spark Summit

Parallelizing Existing R Packages with SparkRDatabricks

FIS: Accelerating Digital Intelligence in FinTech: Spark Summit East talk by...Spark Summit

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...Spark Summit

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit

Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit

Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit

Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...Spark Summit

Viewers also liked (20)

Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...

IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...

Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...

Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...

Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...

Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...

Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...

Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...

Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...

Parallelizing Existing R Packages with SparkR

FIS: Accelerating Digital Intelligence in FinTech: Spark Summit East talk by...

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

Realtime Analytical Query Processing and Predictive Model Building on High Di...

Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...

Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...

Similar to Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

PPT ON MACHINE LEARNING by Ragini RatreRaginiRatre

Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks

ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri

Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit

Spark Based Distributed Deep Learning Framework For Big Data Applications Humoyun Ahmedov

An Introduction to Spark with ScalaChetan Khatri

Structuring Spark: DataFrames, Datasets, and StreamingDatabricks

Jump Start into Apache® Spark™ and DatabricksDatabricks

DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...GeeksLab Odessa

DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax

Spark Sql and DataFramePrashant Gupta

Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovFwdays

TF Dev Summit 2019Ray Hilton

AI與大數據數據處理 Spark實戰(20171216)Paul Chao

New Developments in SparkDatabricks

PMED Undergraduate Workshop - R Tutorial for PMED Undegraduate Workshop - Xi...The Statistical and Applied Mathematical Sciences Institute

RDataMining slides-r-programmingYanchang Zhao

Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks

My Master's ThesisHumoyun Ahmedov

Similar to Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson (20)

PPT ON MACHINE LEARNING by Ragini Ratre

Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...

ScalaTo July 2019 - No more struggles with Apache Spark workloads in production

Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust

Spark Based Distributed Deep Learning Framework For Big Data Applications

An Introduction to Spark with Scala

Structuring Spark: DataFrames, Datasets, and Streaming

Jump Start into Apache® Spark™ and Databricks

DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...

DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...

Spark Sql and DataFrame

Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov

TF Dev Summit 2019

AI與大數據數據處理 Spark實戰(20171216)

New Developments in Spark

PMED Undergraduate Workshop - R Tutorial for PMED Undegraduate Workshop - Xi...

RDataMining slides-r-programming

Best Practices for Building and Deploying Data Pipelines in Apache Spark

My Master's Thesis

Recently uploaded

RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795

Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls

Digital Transformation Playbook by Graham WareGraham Ware

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg

Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14

High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515

Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg

Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann

TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls

Discover Why Less is More in B2B Researchmichael115558

怎样办理圣地亚哥州立大学毕业证（SDSU毕业证书）成绩单学校原版复制vexqp

Recently uploaded (20)

RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed

Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...

Digital Transformation Playbook by Graham Ware

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...

Lecture_2_Deep_Learning_Overview-newone1

High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...

Aspirational Block Program Block Syaldey District - Almora

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...

Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK

TrafficWave Generator Will Instantly drive targeted and engaging traffic back...

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...

Discover Why Less is More in B2B Research

怎样办理圣地亚哥州立大学毕业证（SDSU毕业证书）成绩单学校原版复制

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

1. Erik Erlandson Sketching Data With T-Digest in Apache Spark Red Hat, Inc.

2. Introduction Erik Erlandson Software Engineer at Red Hat, Inc. Emerging Technologies Group Internal Data Science Insightful Applications

3. Why Sketching? ● Faster ● Smaller ● Essential Features

4. We All Sketch Data 3.4 6.0 2.5 ⋮ Mean = 3.97 Variance = 3.30 3.4, 5.0, 9.0 6.0, 2.1, 7.7 2.5, 4.4, 3.2 ⋮

5. T-Digest • Computing Extremely Accurate Quantiles Using t-Digests • Ted Dunning & Omar Ertl • https://github.com/tdunning/t-digest • Implementations in Java, Python, R, JS, C++ and Scala

6. What is T-Digest Sketching? 3.4 6.0 2.5 ⋮ (3.4, 3) (6.0, 2) (2.5, 8) ⋮ or Sketch of CDF P(X <= x) X Data Domain

7. Incremental Updates Current T-Digest + (x, w) = Updated T-Digest Large or Streaming Data Compact “Running” Sketch

8. The Payoff REST Service Query Latencies What does my latency distribution look like? I want to simulate my latencies! Are 90% of my latencies under 1 second?

9. Representation clusters Distribution CDF (location, mass) (x, m)

10. Update (x, m) Nearest Cluster Update location Increment Mass

11. Cluster Mass Bounds q=0 q=1 C∙M/4 Quantiles q(x) M = (masses) B(x) = C∙M∙q(x)∙(1-q(x)) C = compression

12. Bounds Force New Clusters (x,m) mc + m? (xc ,mc ) mc + m > B(xc )! (xc ,mc ) (xu ,B(xc )) (x, B(xc )-(mc + m)) (x,m)

13. Resolution q=0 q=1 More small clusters Fewer Large Clusters

14. T-Digests are Monoidal C1 ∪ C2 D1 |+| D2 D1 ≡ C1 D2 ≡ C2 C1 ∪ C2 ⟹

15. Monoidal => Map-Reduce P1 P2 Pn |+| Data in Spark t-digests result Map

16. 7 |+| - Randomized Order 1 3 5 92 4 86 1110 7 1 3 5 9 24 86 1110D1 |+| D2 ⟸

17. 7 |+| - Merged Order 1 3 5 92 4 86 1110 7 1 3 5 92 4 86 1110D1 |+| D2 ⟸

18. 7 |+| - Large to Small 1 3 5 92 4 86 1110 7 1 3 5 924 8 611 10 D1 |+| D2 ⟸

19. Comparing |+| Definitions

20. Algorithmic Considerations • Clusters maintained in sorted order by location • Clusters frequently inserted / deleted / updated • Query the cluster nearest to an incoming (x,m) • Given (x,m), query the prefix-sum of cluster mass – (m’), over all clusters (x’,m’) where x’ <= x • Do it all in logarithmic time!

21. Backed By Balanced Tree

22. Scala Considerations • Immutable Red/Black Tree • Extends Map and MapLike • Capabilities are Mixable Traits – Red/Black – Ordered – Incrementable-Values – Nearest-Neighbor – Prefix-Sum • Interface to Algebird Monoids & Aggregators

23. Discrete Distributions If (tdigest.clusters.size <= max_discrete) { // increment by m (or insert new) tdigest.clusters.increment(x, m) } else { // do full t-digest cluster updating algorithm tdigest.update(x, m) } Experim ental

24. Applications • Quantile Estimation • Feature Data Characterization • Building CoDecs • Value-At-Risk Modeling • Generative Data Models

25. Demo

26. Thank You eje@redhat.com @manyangled https://github.com/isarn/isarn-sketches

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Similar to Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson