Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bullet: A Real Time Data Query Engine

304 views

Published on

Bullet is an open sourced, lightweight, pluggable querying system for streaming data without a persistence layer implemented on top of Storm. It allows you to filter, project, and aggregate on data in transit. It includes a UI and WS. Instead of running queries on a finite set of data that arrived and was persisted or running a static query defined at the startup of the stream, our queries can be executed against an arbitrary set of data arriving after the query is submitted. In other words, it is a look-forward system.

Bullet is a multi-tenant system that scales independently of the data consumed and the number of simultaneous queries. Bullet is pluggable into any streaming data source. It can be configured to read from systems such as Storm, Kafka, Spark, Flume, etc. Bullet leverages Sketches to perform its aggregate operations such as distinct, count distinct, sum, count, min, max, and average.

An instance of Bullet is currently running at Yahoo against its user engagement data pipeline. We’ll highlight how it is powering internal use-cases such as web page and native app instrumentation validation. Finally, we’ll show a demo of Bullet and go over query performance numbers.

Published in: Technology
  • Be the first to comment

Bullet: A Real Time Data Query Engine

  1. 1. A REAL TIME DATA QUERY ENGINE Michael Natkovich, Akshai Sarma
  2. 2. 3 ALLOW MYSELF TO INTRODUCE … MYSELF • Akshai Sarma • asarma@yahoo-inc.com • Senior Engineer • 4+ years of solving data problems at Yahoo
  3. 3. 4 ALLOW MYSELF TO INTRODUCE … MYSELF • Michael Natkovich • mln@yahoo-inc.com • Director Engineer • 10+ years of causing data problems at Yahoo
  4. 4. 5 THE WHY
  5. 5. 6 INSTRUMENTATION • Is code added to web pages or apps to track usage and behavior • User Identity • Engagement • Location • Drives all data applications • Targeting • Personalization • Analytics
  6. 6. 7 CYCLE OF SADNESS • Instrumentation validation is unbearably slow • Needs to be seconds not minutes or hours • Needs to be easy to query • Needs programmatic access
  7. 7. 8 EXISTING OPTIONS Type Latency Downside Batch Hours Too slow Mini-Batch 20 Minutes Faster, but still too slow Streaming 2 Seconds Fast enough, but no way to query • The various ways to obtain data were either: • Not fast enough • Impossible to query
  8. 8. 9 THE WHAT
  9. 9. 10 TYPICAL QUERYING Data Flow Persistence Queries
  10. 10. 11 ATYPICAL QUERYING Future Queryable Data Old Un-Queryable DataCurrent Queryable Data Query Engine Query Results Data Flow
  11. 11. 12 BULLET • Retrieves data that arrives after query submission – Look Forward • No persistence layer • Light-weight, fast, and scalable • UI for Ad-Hoc queries • API for programmatic querying • Pluggable interface to integrate with streaming data
  12. 12. 13 QUERYING IN BULLET • Support filtering, logical operators on typed data • Supports aggregations • Group By, Count Distincts, Top K, Distributions • DataSketches based • Queries have life spans • All queries run for a specified time window • Raw queries can terminate early if they have seen a minimum number of records
  13. 13. 14 DATASKETCHES • Sketches are a class of stochastic streaming algorithms • Provides approximate results (if data is too large) • Provable error bounds • Fixed memory footprint • Mergeable, allowing for parallel processing Sketches logo from https://datasketches.github.io
  14. 14. 15 DEMO
  15. 15. 16
  16. 16. 17 Request Processor Data Processor Combiner Bullet Data StreamBullet WS Query Results Results Query & ID Query & ID Data Records Matching Events & ID Query FLOW Performance Stats Sensor Data User Activity IoT Data
  17. 17. 18 USE CASES: BEYOND INSTRUMENTATION VALIDATION • See sample values of a field • What’s a country code look like? • Cardinality of fields for Druid ingestion • 10s, 100s, 1000s of unique values? • Check that a new experiment is running • Is data coming in for all my test buckets?
  18. 18. 19 THE HOW
  19. 19. 20 ARCHITECTURE
  20. 20. 21 STORM TERMINOLOGY • Tuple: The basic unit of data in Storm • Stream: An unbounded set of tuples • Spout: Source of tuples Kafka, Flume etc. • Bolt: Tuple processor • Topology: A DAG of Spouts and Bolts • DRPC: Distributed Remote Procedure Call Storm logo from https://storm.apache.org
  21. 21. 22
  22. 22. 23 PLUGGABLE INTERFACE 1. Run Bullet on your data • Write a Spout/Topology to read data from Kafka, Flume, HDFS etc. • Convert data into a Bullet Record (AVRO) 2. Plug in a schema (if you need the UI) • JSON based • Provides field names, types, and descriptions 3. Plug in a default starting query (Optional for the UI) • Example: A query based on a UI users’ cookie
  23. 23. 24 PERFORMANCE • Scaling for data 1. Scale pluggable data processing component 2. Scale Filter Bolts for handling data volume • Scaling for simultaneous queries 1. Scale Filter Bolts 2. Scale DRPC components – DRPC servers primarily 3. Scale Join Bolts
  24. 24. 25 TEST HARDWARE • Storm 1.0 Cluster • 2 x Intel E5-2680v3 (12 Core, 24 Threads) – 48 V. Cores • 256 GB RAM • 10 G Network Interface • Multi-tenant • Reading data from a Kafka 0.10 cluster • In the same data center so network delays are minimal
  25. 25. 26 DATA • Average size: 4.33 KiB compressed (1.2 compression ratio) • Data Volume: Records per second (R/s), Mebibytes per second (MiB/s) • 92 top-level fields • 62 Strings • 4 Longs • 23 Maps • 3 Lists of Maps
  26. 26. 27 FINDING A SINGLE GENERATED RECORD • Data Volume : 67,400 R/s and 104 MiB/s • Average of 100 Bullet queries to find a single generated record Timestamp Delay (ms) Query received in Bullet 0 Record generated 31.3 Record submitted to Kafka 357.9 Record received in Bullet 1008.2 Record found in Bullet 1015.4 Query finished in Bullet 1018.3 • Bullet latency is 1018.3 – 1008.2 = 10.1 ms
  27. 27. 28 SCALING FOR DATA: GOALS • Read the data • Catch up on data backlog at > 5 : 1 ratio (5s of backlog in 1s) • Support 400 Raw Bullet queries concurrently • Max record finding latency < 200 ms at 400 queries
  28. 28. 29 SCALING FOR DATA: CPU
  29. 29. 30 SCALING FOR DATA: MEMORY
  30. 30. 31 SCALING FOR DATA: SUMMARY • For 400 Raw queries and data reading goals • CPU to Memory ratio • 1 core : 1.2 GiB • CPU to Data ratio • 1 core : 856 R/s • 1 core : 3.4 MiB/s
  31. 31. 32 SCALING FOR QUERIES: GOALS • Fixed Data Volume : 68,400 R/s and 105 MiB/s • Latency to find a record after the record is first seen by Bullet • As number of Filter Bolts (1 V. Core, 1024 GiB RAM) varies • As number of simultaneous Raw queries varies • Each query runs for 30s and looks for 10 generated records • Want max latency < 200 ms
  32. 32. 33 SCALING FOR QUERIES: CPU
  33. 33. 34 SCALING FOR QUERIES: LATENCY
  34. 34. 35 SCALING FOR QUERIES: DRPC • 3 DRPC servers on our test Storm cluster • 2 x Intel E5620 (4 cores, 8 Threads) - 16 V. Cores • 24 GB RAM • 10 G Network • About 700 simultaneous Bullet queries • Horizontally scalable • Blocking threads at the moment • Async implementation in Storm 2.0
  35. 35. 36 ANNOUNCING OPEN SOURCE • We are on GitHub! • Documentation: https://yahoo.github.io/bullet-docs • Contributions, ideas, feedback welcome! Component Repo Storm https://github.com/yahoo/bullet-storm WS https://github.com/yahoo/bullet-service UI https://github.com/yahoo/bullet-ui Record https://github.com/yahoo/bullet-record
  36. 36. 37 SUMMARY • Wanted to validate instrumentation but ended up with generic querying • Query any data that can be plugged into Storm • Queries first, then data  Look-forward querying • Persists no data  Light-weight and cheap! • Fetch Raw data • Aggregate: Group By, Top K, Distributions, Count Distinct
  37. 37. 38 FUTURE WORK • Considering Pub/Sub queue to receive and send queries and results • Allows Bullet implementations on other Stream processors • Incremental updates • WebSockets or SSE to push results • Streaming results • Additive results • Security • SQL interface
  38. 38. 39 THANKS • Nathan Speidel • Cat Utah • Marcus Svedman • Satish Vanimisetti
  39. 39. 40 LINKS • Contact Us • Developers : bullet-dev@googlegroups.com • Users : bullet-users@googlegroups.com • Documentation : https://yahoo.github.io/bullet-docs • DataSketches: https://datasketches.github.io
  40. 40. 41 APPENDIX
  41. 41. 42 COUNT DISTINCT: NAIVE 1. Read Input 2. Round Robin 3. Extract Field 4. Send to Combiner 5. Count Distincts Overwhelm Single Combiner
  42. 42. 43 COUNT DISTINCT: TYPICAL 1. Read Input 2. Round Robin 3. Extract Field 4. Hash Partition 5. Count Distincts 6. Send Count 7. Combined Counts Vulnerable to Data Skew
  43. 43. 44 COUNT DISTINCT: SKETCHES 1. Read Input 2. Round Robin 3. Build Sketch 4. Send to Combiner 5. Merge Sketches

×