In this talk, we will present how we dealt with the challenges of implementing intractable aggregations such as counting distincts, finding top K items, or getting percentiles of an unknown distribution (such as the 99th percentile) and more on arbitrary streaming data. Handling this challenge while also implementing various windowing mechanisms (tumbling, hopping, sliding etc) for obtaining the results of these aggregations is a pretty hefty task. Throwing this challenge onto a system that operates with no persistence layer on arbitrary, very high volume data streams in today’s IoT world seems like an impossible problem. We will address how we solved all this using DataSketches in a simple and elegant manner in our streaming query engine called Bullet. We will compare our different approaches, show why we settled on using DataSketches, and what the tradeoffs were.
Bullet is an open-sourced, lightweight, scalable, multi-tenant query system that lets you query any data flowing through a streaming system without having to store it. Bullet can run arbitrary queries against an unbounded set of data that arrives after the query is submitted. (Bullet queries look forward in time.) These queries can filter, project, and aggregate data in transit. Bullet is also platform- and framework-agnostic. Almost all the layers in Bullet can be mixed and matched with different implementations using our core abstractions such as Storm, Spark, etc. for the backend layer, Kafka or another messaging queue for the PubSub layer and so on. We will explain our motivation for creating Bullet, the new architecture of Bullet, and how DataSketches fits into it. Finally, we will also do a demo of our latest changes to Bullet on a real, high-volume dataset at use in production.
Bullet documentation: https://yahoo.github.io/bullet-docs
DataSketches documentation: https://datasketches.github.io
Speakers
Akshai Sarma, Yahoo, Principal Software Engineer
Michael Natkovish, Yahoo, Director Engineering
2. Allow Myself to Introduce . . . Myself
Yahoo Confidential & Proprietary
■ Akshai Sarma
● asarma@oath.com
● Principal Engineer
● 5+ years of solving data problems at Yahoo
3. Allow Myself to Introduce . . . Myself
Yahoo Confidential & Proprietary
■ Michael Natkovich
● mln@oath.com
● Director Engineer
● 10+ years of causing data problems at Yahoo
5. Look Forward Query Engine
Yahoo Confidential & Proprietary
Data Flow
Query Engine
Current Queryable Data
Future Queryable Data Old Un-Queryable Data
Query Results
6. Typical Streaming Query Cost
Yahoo Confidential & Proprietary
Data Stream
Query 1 Query 2 Query 10...
1MM events/sec
1K events/sec/core
1K cores per query
to read the data
10K cores to read the
data for 10 Queries
7. Bullet Query Cost
Yahoo Confidential & Proprietary
Data Stream
Query 1 Query 2 Query 10...
1MM events/sec
1K events/sec/core
1K cores to read
the data
10x fewer cores
for 10 Queries
8. Bullet
Yahoo Confidential & Proprietary
■ Retrieves data that arrives after query submission
● Look Forward!
■ No persistence layer
■ Light-weight, fast, and scalable
■ UI for Ad-Hoc queries
■ API for programmatic querying
■ Pluggable interface to integrate with streaming data
9. Pluggable Interfaces
Yahoo Confidential & Proprietary
1. Create a Schema
● JSON based
● Provides column names, data types, and descriptions
2. Create a Record Converter
● Convert your record format to a BulletRecord
2. Create a Default Starting Query (Optional)
10. Querying in Bullet
Yahoo Confidential & Proprietary
■ Support filtering, logical operators on typed data
■ Supports aggregations
● Group By, Count Distincts, Top K, Distributions
● DataSketches based
■ Queries have life spans
● All queries run for a specified time window
● Raw queries can terminate early if they have seen a minimum
number of records
11. Data Sketches
Yahoo Confidential & Proprietary
■ Sketches are a class of stochastic
streaming algorithms
■ Provides approximate results (if data
is too large)
■ Provable error bounds
■ Fixed memory footprint
■ Mergeable, allowing for parallel
processing
12. Data Sketches in Streams
Yahoo Confidential & Proprietary
■ Accurate to a Point
● Sketches sized correctly will be 100% accurate
● Error rate is inversely proportional to size of a Sketch
■ Fixed Memory Ceiling
● Maximum Sketch size is configured in advance
● Memory cost of a query is thus known in advance
■ Allows Non-additive Operations to be Additive
● Sketches can be merged into a single Sketch without over
counting
● Allows tasks to be parallelized and cheaply combined later
● Allows results to be combined across windows of execution
13. Bullet’s Use of Data Sketches
Yahoo Confidential & Proprietary
Data Sketch Query Type
Theta Sketch Count Distinct
Tuple Sketch Group By
Quantile Sketch Distributions
Frequent Items Sketch Top K
14. Overwhelm Single Combiner
1. Read Input
2. Round Robin
3. Extract Field
4. Send to Combiner
5. Count Distincts
Count Distinct: Naive
15. 1. Read Input
2. Round Robin
3. Extract Field
4. Hash Partition
5. Count Distinct
6. Send Count
7. Combine Counts
Vulnerable to Data Skew
Count Distinct: Typical
16. 1. Read Input
2. Round Robin
3. Build Sketch
4. Send to Combiner
5. Merge Sketches
Count Distinct: Sketches
17.
18. Query
& ID
Yahoo Confidential & Proprietary
Request
Processor
Data
Processor
Combiner
Bullet Data Stream
Bullet
WS
Performance Stats
Sensor Data
User Activity
IoT Data
Query
Results
Results Query & ID
Query & ID
Data Records
Matching
Events & ID
21. Future Work
Yahoo Confidential & Proprietary
■ SQL-like interface support
■ Security: User-group based authorization
■ Security: User based authentication
22. In Summary
Yahoo Confidential & Proprietary
■ Bullet is a lightweight and cheap stream query engine
■ It offers raw record and OLAP style queries
■ Leverages the power of Data Sketches
■ Only need to enough hardware to read data
● Queries are basically free!
■ Abstraction layer that can sit on any Stream Framework
● Implementations available for Storm and Spark
■ Pluggable allowing for consumption from any data source
■ Fully open sourced