Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Druid from POC 2 Production

Published in: Data & Analytics
  • Be the first to comment


  1. 1. Druid in Production Dori Waldman - Big Data Lead Guy Shemer - Big Data Expert Alon Edelman - Big Data consultant
  2. 2. ● Druid ○ Demo ○ What is druid and Why you need it ○ Other solutions ... ● Production pains and how it works : ○ cardinality ○ cache ○ dimension types (list/regular/map) ○ segment size ○ partition ○ monitoring and analyze hotspot ○ Query examples ○ lookups ● Interface ○ Pivot, Facet , superset ... ○ Druid-Sql (JDBC) ○ Rest Agenda
  3. 3. Demo
  4. 4. Why ? ● fast (real-time) analytics on large time series data ○ MapReduce / Spark - are not design for real time queries. ○ MPP expensive / slow ● Just send raw data to druid → mention which attributes are the dimensions, metrics and how to aggregate the metrics → Druid will create cube (datasource) ○ Relational does not scale , we need fast queries on large data. ○ Key-value tables require table per predefined query , and we need dynamic queries (cube) ● We want to answer questions like: ○ #edits on the page Justin Bieber from males in San Francisco? ○ average #characters , added by people from Calgary over the last month? ○ arbitrary combination of dimensions to return with subsecond latencies.
  5. 5. Row value can be dimension (~where in sql) or metric (measure) ● Dimensions are fields that can be filtered on or grouped by. ● Metrics are fields that can be aggregated. They are often stored as numbers but can also be stored as HyperLogLog sketches (approximate).. For example If Click is a dimension we can select this dimension and see how the data is splitted according to the selected value (might be better to convert as categories 0-20) If Click is a metric it will be a counter result like for how many clicks we have in Israel Dimension / Metric Country ApplicationId Clicks Israel 2 18 Israel 3 22 USA 80 19
  6. 6. Other options ● Open source solution: ○ Pinot ( ○ clickHouse ( ○ Presto (
  7. 7. Druid Components
  8. 8. Components ● RealTime nodes - ingest and query event streams, events are immediately available to query, saved in cache and persist to global storage (s3/hdfs) “deepStorage” ● Historical nodes - load and serve the immutable blocks of data (segments) from deep storage, those are the main workers ● Broker nodes - query routers to historical and real-time nodes, communicate with ZK to understand where relevant segments are located ● Coordinator nodes - tell historical nodes to load new data, drop outdated data, replicate data, and Balance by move data
  9. 9. Components ● Overlord node - Manages task distribution to middle managers. responsible for accepting tasks, coordinating task distribution, creating locks around tasks, and returning statuses to callers ● Middle manager node - Executes submitted tasks by forward slices of tasks to peons. In case druid runs in local mode, this part is redundant since overlord will also take this responsibility ● Peon - single task in a single JVM. Several peons may run on same node
  10. 10. Components (Stream) ● Tranquility - ○ Ingest from kafka/http/samza/flink … ○ Will be out of life ○ Connects to the ZooKeeper of the kafka cluster ○ Can connect to several clusters and read from several topics for the same Druid data source. ○ Can't handle events after window closes ● Kafka-Indexing-Service ○ ingest from kafka only (kafka 10+) ○ Connects directly to Kafka’s brokers ○ Is able to connect to one cluster and one topic for one druid data source ○ Indexer manage its tasks better, use checkpoint (~exactly once) ○ Can update events for old segments (no window) ○ can be spot instance (for other nodes its less recommended)
  11. 11. Extensions
  12. 12. Batch Ingestion
  13. 13. Batch Ingestion Druid support : JSON CSV AVRO PARQUET ORC Support: ● Values ● multiValue (array) - each item in the list will be explode to row ● maps (new feature)
  14. 14. Batch Ingestion Indexing Task Types ● index_hadoop (with EMR) ○ Hadoop-based batch ingestion. using Hadoop/EMR cluster to perform data processing and ingestion ● Index (No EMR) ○ For small amount of data , task execute within the indexing service without external hadoop resources
  15. 15. Batch Ingestion Input source for batch indexing ● local ○ For POC ● Static (S3/ HDFS etc..) ○ Ingesting from your raw data ○ Support also Parquet ○ Can be mapped dynamically to specific date ● Druid’s Deep Storage ○ Use segments from one datasource from deep storage and transform them to another datasource, clean dimensions, change granularity etc.. "inputSpec" : { "type" : "static", "paths" : "/MyDirectory/example/wikipedia_data.json" } "inputSpec": { "type": "static", "paths": "s3n://prod/raw/2018-01-01/00/, s3n://staging/raw/2018-01-01/00/", "filePattern": ".gz" } "inputSpec": { "type": "dataSource", "ingestionSpec": { "dataSource" : "Hourly", "intervals" : ["2017-11-06T00:00:00.000Z/2017-11-07T00:00:00.000Z"] }
  16. 16. Lookups
  17. 17. Lookups ● Purpose : replace dimensions values , for example replace “1” with “New York City” ● in case the mapping is 1:1 an optimization (“injective”:true) should be used, it will replace the value on the query result and not on the query input ● Lookups has no history (if value of 1 was “new york” and it was changed to “new your city” the old value will not appear in the query result. ● Very small lookups (count of keys on the order of a few dozen to a few hundred) can be passed at query time as a "map" lookup ● Usually you will use global cached lookups from DB / file / kafka
  18. 18. Queries
  19. 19. Query: TopN ● TopN ○ grouped by single dimension, sort (order) according to the metric (~ “group by” one dimension + order ) ○ TopNs are approximate in that each node will rank their top K results and only return those top K results to the broker ○ To get exact result use groupBy query and sort the results (better to avoid)
  20. 20. Query: TopN ● TopN Hell- in the Pivot Pivot use nested TopN’s (filter and topN per row) Try to reduce number of unnecessary topN queries
  21. 21. Query: GroupBy GroupBy ○ Grouped by multiple dimensions. ○ Unlike TopN, can use ‘having’ conditions over aggregated data. Druid vision is to replace timeseries and topN with groupBy advance query
  22. 22. Query: TimeSeries ● Timeseries ○ grouped by time dimension only (“no dimensions) ○ Timeseries query will generally be faster than groupBy as it taking advantage of the fact that segments are already sorted on time and does not need to use a hash table for merging.
  23. 23. Query: SQL ● Druid SQL ○ Translates SQL into native Druid queries on the query broker ■ using JSON over HTTP by posting to the endpoint /druid/v2/sql/ ■ SQL queries using the Avatica JDBC driver.
  24. 24. Query: TimeBoundary / MetaData ● Time boundary ○ Return the earliest and latest data points of a data set ● Segment metadata ○ Per-segment information: ■ dimensions cardinality, ■ min/max value in dimension ■ number of rows ● DataSource metadata ○ ...
  25. 25. Other Queries... ● Select / Scan / Search ○ select - supports pagination, all data is loaded to memory ○ scan - return result in streaming mode ○ search - returns dimension values which match a search criteria. The biggest difference between select and scan is that, scan query doesn't retain all rows in memory before rows can be returned to client.
  26. 26. Query Performance ● Query with metrics Metric calculation is done in real time per metric meaning doing sum of impression and later sum of impressions and sum of clicks will double the metric calculation time (think about list dimension...)
  27. 27. Druid in Fyber
  28. 28. 29PAGE // Druid Usage Hour Day
  29. 29. ● Index 5T row daily from 3 different resources (s3 / kafka) ● 40 dimensions, 10 metrics ● Datasource (table) should be updated every 3 hours ● Query latency ~10 second for query on one dimension , 3 month range ○ Some dimensions are list … ○ Some dimensions use lookups Requirements
  30. 30. Work in scale ● We started with 14 dimensions (no lists) → for 8 month druid answer all requirements ● We added 20 more dimensions (with list) → druid query time was slow ...
  31. 31. ● Hardware : ○ 30 nodes(i3.8xlarge), each node manage historical and middleManager service ○ 2 nodes (m4.2xlarge) , each node manage coordinator and overload services ○ 11 nodes (c4.2xlarge), each node manage tranquility service ○ 2 nodes (i3.8xlarge), each node manage broker service ■ (1 broker : 10 historical) ○ Memcached : 3 nodes (cache.r3.8xlarge), version: 1.4.34 Hardware
  32. 32. Data cleanup ● Cleanup reduce cardinality (replace it with dummy value) ● Its all about reducing number of rows in the datasource ○ Druid saves the data in columnar storage but in order to get better performance the cleanup process reduces #rows (although the query is on 3 columns it needs to read all items in the column)
  33. 33. Data cleanup ● The dimensions correlation is important. ○ lets say we have one dimension which is city with 2000 unique cities ■ Adding gender dimension will double #rows (assume in our row data we have both male/female per city) ■ Adding country (although we have 200 unique countries) will not impact the same (cartesian product) as there is a relation between city and county of 1:M. ● better to reduce non related dimensions like country and age
  34. 34. Data cleanup ○ Use timeseries query with “count” aggregation (~ count(*) in druid Sql) to measure your cleanup benefit ○ you can also use estimator with cardinality aggregation ○ if you want to estimate without doing cleanup you can use virtualColumns (filter out specific values) with byRow cardinality estimator
  35. 35. segments ● Shard size should be balanced between disk optimization 500M-1.5G and cpu optimization (core per segment during query), take list in this calculation … Shard minimum size should be 100M ● POC - convert list to bitwise vector
  36. 36. Partition ● Partition type ○ By default, druid partitions the data according to timestamp In addition you need to specify hashed/ single dimension partition ■ partition may result with unbalanced segments ■ The default of hashed partition using all dimensions ■ Hashed partitioning is recommended in most cases, as it will improve indexing performance and create more uniformly sized data segments relative to single-dimension partitioning. ■ single-dimension partition may be preferred in context of multi tenancy use cases. ■ Might want to avoid default hashed in case of long tail "partitionsSpec": { "type": "hashed", "targetPartitionSize": "4500000" } "partitionsSpec": { "type": "dimension", "targetPartitionSize": "10000000", "partitionDimension": publisherId" } "partitionsSpec": { "type": "hashed", "numShards": "12", "partitionDimensions": ["publisherId"] }
  37. 37. Cache ● Cache : ○ hybrid ■ L1-caffein (local), ■ L2-memcached (global) when segment move between machines , caffein cache will be invalidate for those segments ○ warm Cache for popular queries (~300ms) ○ Cache is saved per segment and date , cache key contains the dimensions ,metrics , filters ○ TopN threshold are part of the key : 0-1000, 1001, 1002 … ○ Cache in the historical nodes not broker in order to merge less data in the broker side
  38. 38. Cache ● Cache : ○ Lookup has pollPeriod meaning if its set to 1 day then cache will be invalid (no eviction) every day even if lookup was not updated (tscolumn), since Imply 2.4.6 this issue should be fixed by setting injective=true in the lookup configuration meaning lookup is not part of the cache key anymore its a post-aggregation action in the brokers. ■ increase lookup pooling period + hard set injective=true in the query is workaround till 2.4.6 ○ rebuild segment (~new) cause the cache to be invalidate
  39. 39. Cache Monitoring All segments (507) are scanned not from the cache Broker logs
  40. 40. ● Production issue : ○ Cluster was slow ■ doing rebalance all the time ■ nodes disappear , no crash in the nodes ■ we found that during this time GC took long time , and in the log we saw ZK disconnect-connect ○ We increased ZK connection timeout ○ Solution was to decrease historical memory (reduce GC time)
  41. 41. Monitoring / Debug Fix hotspot by increase #segment to move till data is balanced Statsd emitter does not send all metrics , use another (clarity / kafka)
  42. 42. Druid Pattern ● Two data sources ○ small (less rows and dimensions) ○ Large (all data) , query with filter only
  43. 43. Extra ● Load rules are used to manage which data is available to druid, for example we can set it to save only last month data and drop old data every day ● Priority - druid support query by priority ● Avoid Javascript extension (post aggregation function)
  44. 44. THANK YOU