DataSketch based aggregations and windowing in a streaming query system

Akshai Sarma and Michael Natkovich
June 21, 2018

Allow Myself to Introduce . . . Myself
Yahoo Confidential & Proprietary
■ Akshai Sarma
● asarma@oath.com
● Principal Engineer
● 5+ years of solving data problems at Yahoo

Allow Myself to Introduce . . . Myself
■ Michael Natkovich
● mln@oath.com
● Director Engineer
● 10+ years of causing data problems at Yahoo

Typical Query Engine
Data Flow
Persistence
Queries

Look Forward Query Engine
Data Flow
Query Engine
Current Queryable Data
Future Queryable Data Old Un-Queryable Data
Query Results

Typical Streaming Query Cost
Data Stream
Query 1 Query 2 Query 10...
1MM events/sec
1K events/sec/core
1K cores per query
to read the data
10K cores to read the
data for 10 Queries

Bullet Query Cost
Data Stream
Query 1 Query 2 Query 10...
1MM events/sec
1K events/sec/core
1K cores to read
the data
10x fewer cores
for 10 Queries

Bullet
■ Retrieves data that arrives after query submission
● Look Forward!
■ No persistence layer
■ Light-weight, fast, and scalable
■ UI for Ad-Hoc queries
■ API for programmatic querying
■ Pluggable interface to integrate with streaming data

Pluggable Interfaces
1. Create a Schema
● JSON based
● Provides column names, data types, and descriptions
2. Create a Record Converter
● Convert your record format to a BulletRecord
2. Create a Default Starting Query (Optional)

Querying in Bullet
■ Support filtering, logical operators on typed data
■ Supports aggregations
● Group By, Count Distincts, Top K, Distributions
● DataSketches based
■ Queries have life spans
● All queries run for a specified time window
● Raw queries can terminate early if they have seen a minimum
number of records

Data Sketches
■ Sketches are a class of stochastic
streaming algorithms
■ Provides approximate results (if data
is too large)
■ Provable error bounds
■ Fixed memory footprint
■ Mergeable, allowing for parallel
processing

Data Sketches in Streams
■ Accurate to a Point
● Sketches sized correctly will be 100% accurate
● Error rate is inversely proportional to size of a Sketch
■ Fixed Memory Ceiling
● Maximum Sketch size is configured in advance
● Memory cost of a query is thus known in advance
■ Allows Non-additive Operations to be Additive
● Sketches can be merged into a single Sketch without over
counting
● Allows tasks to be parallelized and cheaply combined later
● Allows results to be combined across windows of execution

Bullet’s Use of Data Sketches
Data Sketch Query Type
Theta Sketch Count Distinct
Tuple Sketch Group By
Quantile Sketch Distributions
Frequent Items Sketch Top K

Overwhelm Single Combiner
1. Read Input
2. Round Robin
3. Extract Field
4. Send to Combiner
5. Count Distincts
Count Distinct: Naive

1. Read Input
2. Round Robin
3. Extract Field
4. Hash Partition
5. Count Distinct
6. Send Count
7. Combine Counts
Vulnerable to Data Skew
Count Distinct: Typical

1. Read Input
2. Round Robin
3. Build Sketch
4. Send to Combiner
5. Merge Sketches
Count Distinct: Sketches

Query
& ID
Request
Processor
Data
Processor
Combiner
Bullet Data Stream
Bullet
WS
Performance Stats
Sensor Data
User Activity
IoT Data
Query
Results
Results Query & ID
Query & ID
Data Records
Matching
Events & ID

Overall Architecture

Backend Layer Detailed Architecture

Future Work
■ SQL-like interface support
■ Security: User-group based authorization
■ Security: User based authentication

In Summary
■ Bullet is a lightweight and cheap stream query engine
■ It offers raw record and OLAP style queries
■ Leverages the power of Data Sketches
■ Only need to enough hardware to read data
● Queries are basically free!
■ Abstraction layer that can sit on any Stream Framework
● Implementations available for Storm and Spark
■ Pluggable allowing for consumption from any data source
■ Fully open sourced

Links
■ Contact Us
● Developers: bullet-dev@googlegroups.com
● Users: bullet-users@googlegroups.com
■ Documentation: https://yahoo.github.io/bullet-docs/
■ Data Sketches: https://datasketches.github.io/

DataSketch based aggregations and windowing in a streaming query system

More Related Content

More from DataWorks Summit

Recently uploaded

DataSketch based aggregations and windowing in a streaming query system