Akshai Sarma and Michael Natkovich
June 21, 2018
Allow Myself to Introduce . . . Myself
Yahoo Confidential & Proprietary
■ Akshai Sarma
● asarma@oath.com
● Principal Engineer
● 5+ years of solving data problems at Yahoo
Allow Myself to Introduce . . . Myself
Yahoo Confidential & Proprietary
■ Michael Natkovich
● mln@oath.com
● Director Engineer
● 10+ years of causing data problems at Yahoo
Typical Query Engine
Yahoo Confidential & Proprietary
Data Flow
Persistence
Queries
Look Forward Query Engine
Yahoo Confidential & Proprietary
Data Flow
Query Engine
Current Queryable Data
Future Queryable Data Old Un-Queryable Data
Query Results
Typical Streaming Query Cost
Yahoo Confidential & Proprietary
Data Stream
Query 1 Query 2 Query 10...
1MM events/sec
1K events/sec/core
1K cores per query
to read the data
10K cores to read the
data for 10 Queries
Bullet Query Cost
Yahoo Confidential & Proprietary
Data Stream
Query 1 Query 2 Query 10...
1MM events/sec
1K events/sec/core
1K cores to read
the data
10x fewer cores
for 10 Queries
Bullet
Yahoo Confidential & Proprietary
■ Retrieves data that arrives after query submission
● Look Forward!
■ No persistence layer
■ Light-weight, fast, and scalable
■ UI for Ad-Hoc queries
■ API for programmatic querying
■ Pluggable interface to integrate with streaming data
Pluggable Interfaces
Yahoo Confidential & Proprietary
1. Create a Schema
● JSON based
● Provides column names, data types, and descriptions
2. Create a Record Converter
● Convert your record format to a BulletRecord
2. Create a Default Starting Query (Optional)
Querying in Bullet
Yahoo Confidential & Proprietary
■ Support filtering, logical operators on typed data
■ Supports aggregations
● Group By, Count Distincts, Top K, Distributions
● DataSketches based
■ Queries have life spans
● All queries run for a specified time window
● Raw queries can terminate early if they have seen a minimum
number of records
Data Sketches
Yahoo Confidential & Proprietary
■ Sketches are a class of stochastic
streaming algorithms
■ Provides approximate results (if data
is too large)
■ Provable error bounds
■ Fixed memory footprint
■ Mergeable, allowing for parallel
processing
Data Sketches in Streams
Yahoo Confidential & Proprietary
■ Accurate to a Point
● Sketches sized correctly will be 100% accurate
● Error rate is inversely proportional to size of a Sketch
■ Fixed Memory Ceiling
● Maximum Sketch size is configured in advance
● Memory cost of a query is thus known in advance
■ Allows Non-additive Operations to be Additive
● Sketches can be merged into a single Sketch without over
counting
● Allows tasks to be parallelized and cheaply combined later
● Allows results to be combined across windows of execution
Bullet’s Use of Data Sketches
Yahoo Confidential & Proprietary
Data Sketch Query Type
Theta Sketch Count Distinct
Tuple Sketch Group By
Quantile Sketch Distributions
Frequent Items Sketch Top K
Overwhelm Single Combiner
1. Read Input
2. Round Robin
3. Extract Field
4. Send to Combiner
5. Count Distincts
Count Distinct: Naive
1. Read Input
2. Round Robin
3. Extract Field
4. Hash Partition
5. Count Distinct
6. Send Count
7. Combine Counts
Vulnerable to Data Skew
Count Distinct: Typical
1. Read Input
2. Round Robin
3. Build Sketch
4. Send to Combiner
5. Merge Sketches
Count Distinct: Sketches
Query
& ID
Yahoo Confidential & Proprietary
Request
Processor
Data
Processor
Combiner
Bullet Data Stream
Bullet
WS
Performance Stats
Sensor Data
User Activity
IoT Data
Query
Results
Results Query & ID
Query & ID
Data Records
Matching
Events & ID
Yahoo Confidential & Proprietary
Overall Architecture
Yahoo Confidential & Proprietary
Backend Layer Detailed Architecture
Future Work
Yahoo Confidential & Proprietary
■ SQL-like interface support
■ Security: User-group based authorization
■ Security: User based authentication
In Summary
Yahoo Confidential & Proprietary
■ Bullet is a lightweight and cheap stream query engine
■ It offers raw record and OLAP style queries
■ Leverages the power of Data Sketches
■ Only need to enough hardware to read data
● Queries are basically free!
■ Abstraction layer that can sit on any Stream Framework
● Implementations available for Storm and Spark
■ Pluggable allowing for consumption from any data source
■ Fully open sourced
Links
Yahoo Confidential & Proprietary
■ Contact Us
● Developers: bullet-dev@googlegroups.com
● Users: bullet-users@googlegroups.com
■ Documentation: https://yahoo.github.io/bullet-docs/
■ Data Sketches: https://datasketches.github.io/
QUESTIONS

DataSketch based aggregations and windowing in a streaming query system

  • 1.
    Akshai Sarma andMichael Natkovich June 21, 2018
  • 2.
    Allow Myself toIntroduce . . . Myself Yahoo Confidential & Proprietary ■ Akshai Sarma ● asarma@oath.com ● Principal Engineer ● 5+ years of solving data problems at Yahoo
  • 3.
    Allow Myself toIntroduce . . . Myself Yahoo Confidential & Proprietary ■ Michael Natkovich ● mln@oath.com ● Director Engineer ● 10+ years of causing data problems at Yahoo
  • 4.
    Typical Query Engine YahooConfidential & Proprietary Data Flow Persistence Queries
  • 5.
    Look Forward QueryEngine Yahoo Confidential & Proprietary Data Flow Query Engine Current Queryable Data Future Queryable Data Old Un-Queryable Data Query Results
  • 6.
    Typical Streaming QueryCost Yahoo Confidential & Proprietary Data Stream Query 1 Query 2 Query 10... 1MM events/sec 1K events/sec/core 1K cores per query to read the data 10K cores to read the data for 10 Queries
  • 7.
    Bullet Query Cost YahooConfidential & Proprietary Data Stream Query 1 Query 2 Query 10... 1MM events/sec 1K events/sec/core 1K cores to read the data 10x fewer cores for 10 Queries
  • 8.
    Bullet Yahoo Confidential &Proprietary ■ Retrieves data that arrives after query submission ● Look Forward! ■ No persistence layer ■ Light-weight, fast, and scalable ■ UI for Ad-Hoc queries ■ API for programmatic querying ■ Pluggable interface to integrate with streaming data
  • 9.
    Pluggable Interfaces Yahoo Confidential& Proprietary 1. Create a Schema ● JSON based ● Provides column names, data types, and descriptions 2. Create a Record Converter ● Convert your record format to a BulletRecord 2. Create a Default Starting Query (Optional)
  • 10.
    Querying in Bullet YahooConfidential & Proprietary ■ Support filtering, logical operators on typed data ■ Supports aggregations ● Group By, Count Distincts, Top K, Distributions ● DataSketches based ■ Queries have life spans ● All queries run for a specified time window ● Raw queries can terminate early if they have seen a minimum number of records
  • 11.
    Data Sketches Yahoo Confidential& Proprietary ■ Sketches are a class of stochastic streaming algorithms ■ Provides approximate results (if data is too large) ■ Provable error bounds ■ Fixed memory footprint ■ Mergeable, allowing for parallel processing
  • 12.
    Data Sketches inStreams Yahoo Confidential & Proprietary ■ Accurate to a Point ● Sketches sized correctly will be 100% accurate ● Error rate is inversely proportional to size of a Sketch ■ Fixed Memory Ceiling ● Maximum Sketch size is configured in advance ● Memory cost of a query is thus known in advance ■ Allows Non-additive Operations to be Additive ● Sketches can be merged into a single Sketch without over counting ● Allows tasks to be parallelized and cheaply combined later ● Allows results to be combined across windows of execution
  • 13.
    Bullet’s Use ofData Sketches Yahoo Confidential & Proprietary Data Sketch Query Type Theta Sketch Count Distinct Tuple Sketch Group By Quantile Sketch Distributions Frequent Items Sketch Top K
  • 14.
    Overwhelm Single Combiner 1.Read Input 2. Round Robin 3. Extract Field 4. Send to Combiner 5. Count Distincts Count Distinct: Naive
  • 15.
    1. Read Input 2.Round Robin 3. Extract Field 4. Hash Partition 5. Count Distinct 6. Send Count 7. Combine Counts Vulnerable to Data Skew Count Distinct: Typical
  • 16.
    1. Read Input 2.Round Robin 3. Build Sketch 4. Send to Combiner 5. Merge Sketches Count Distinct: Sketches
  • 18.
    Query & ID Yahoo Confidential& Proprietary Request Processor Data Processor Combiner Bullet Data Stream Bullet WS Performance Stats Sensor Data User Activity IoT Data Query Results Results Query & ID Query & ID Data Records Matching Events & ID
  • 19.
    Yahoo Confidential &Proprietary Overall Architecture
  • 20.
    Yahoo Confidential &Proprietary Backend Layer Detailed Architecture
  • 21.
    Future Work Yahoo Confidential& Proprietary ■ SQL-like interface support ■ Security: User-group based authorization ■ Security: User based authentication
  • 22.
    In Summary Yahoo Confidential& Proprietary ■ Bullet is a lightweight and cheap stream query engine ■ It offers raw record and OLAP style queries ■ Leverages the power of Data Sketches ■ Only need to enough hardware to read data ● Queries are basically free! ■ Abstraction layer that can sit on any Stream Framework ● Implementations available for Storm and Spark ■ Pluggable allowing for consumption from any data source ■ Fully open sourced
  • 23.
    Links Yahoo Confidential &Proprietary ■ Contact Us ● Developers: bullet-dev@googlegroups.com ● Users: bullet-users@googlegroups.com ■ Documentation: https://yahoo.github.io/bullet-docs/ ■ Data Sketches: https://datasketches.github.io/
  • 24.