Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Our journey with druid - from initial research to full production scale

225 views

Published on

Here at the Nielsen Marketing Cloud we use druid.io (http://druid.io/) as one of our main data stores, both for simple counts and for approximate count-distinct (DataSketches).

It’s been more than a year since we started using it, injecting billions of events each day to multiple druid clusters for different use-cases.

In this meet-up, we will share our journey, the challenges we had, the way we overcame them (at least most of them) and the steps we made to optimize the process around Druid to keep the solution cost effective.

Before diving into Druid, we will briefly present our data pipeline architecture, starting from the front-end serving system, deployed in number of geo-locations, to a centralized Kafka cluster in the cloud, and give some examples of the different processes that consume from Kafka and feed our different data sources.

Published in: Data & Analytics
  • Be the first to comment

Our journey with druid - from initial research to full production scale

  1. 1. Our journey with Druid From initial research to full production scale Danny Ruchman + Itai Yaffe Nielsen
  2. 2. Introduction Danny Ruchman Itai Yaffe ● Software Engineer and team manager ● Focused on big data processing solutions ● Big Data Infrastructure Developer ● Dealing with Big Data challenges for the last 5 years
  3. 3. Nielsen Marketing Cloud (NMC) ● eXelate was acquired by Nielsen 3 years ago ● A Data company ● Machine learning models for insights ● Business decisions ● Targeting
  4. 4. Nielsen Marketing Cloud - questions we try to answer ● How many users of a certain profile can we reach Campaign for fancy women sneakers - ● How many hits for a specific web page in a date range
  5. 5. NMC high-level architecture
  6. 6. NMC high-level architecture
  7. 7. NMC high-level architecture
  8. 8. The need ● Nielsen Marketing Cloud business question ○ How many unique devices we have encountered: ■ over a given date range ■ for a given set of attributes (segments, regions, etc.) ● Find the number of distinct elements in a data stream which may contain repeated elements in real time
  9. 9. The need
  10. 10. ● Store everything ● Store only 1 bit per device ○ 10B Devices-1.25 GB/day ○ 10B Devices*80K attributes - 100 TB/day ● Approximate Possible solutions Naive Bit VectorApprox.
  11. 11. Our journey ● Elasticsearch ○ Indexing data ■ 250 GB of daily data, 10 hours ■ Affect query time ○ Querying ■ Low concurrency ■ Scans on all the shards of the corresponding index
  12. 12. What we tried ● Preprocessing ● Statistical algorithms (e.g HyperLogLog)
  13. 13. ● K Minimum Values (KMV) ● Estimate set cardinality ● Supports set-theoretic operations X Y ● ThetaSketch mathematical framework - generalization of KMV X Y ThetaSketch
  14. 14. KMV intuition
  15. 15. Number of Std Dev 1 2 Confidence Interval 68.27% 95.45% 16,384 0.78% 1.56% 32,768 0.55% 1.10% 65,536 0.39% 0.78% ThetaSketch error K Value Error %ERROR
  16. 16. “Very fast highly scalable columnar data-store” DRUID
  17. 17. Powered by Druid
  18. 18. Why is it cool? ● Store trillions of events, petabytes of data ● Sub-second analytic queries ● Highly scalable ● Cost effective
  19. 19. LongSumAggregator 2016-11-15 Timestamp Attribute Device ID 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Attribute Simple Count 2016-11-15 2016-11-15 2016-11-15 11111 22222 33333 3 1 1 Roll-up - Simple Count
  20. 20. Roll-up - Count Distinct ThetaSketchAggregator 2016-11-15 Timestamp Attribute Device ID 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Attribute Count Distinct 2016-11-15 2016-11-15 2016-11-15 11111 22222 33333 2 1 1
  21. 21. Druid architecture
  22. 22. How do we use Druid
  23. 23. Query performance benchmark Concurrent Queries Avg.ResponseTime Druid . Elasticsearch
  24. 24. Guidelines and pitfalls ● Setup is not easy
  25. 25. Guidelines and pitfalls ● Monitoring your system
  26. 26. Guidelines and pitfalls ● Monitoring your system - important metrics (incomplete list) : ○ Broker query time ○ Historical query time ○ Historical query wait time ○ Pending segments ○ Broker query TTFB ○ ...
  27. 27. Guidelines and pitfalls ● Monitoring your system
  28. 28. Guidelines and pitfalls ● Data modeling ○ Reduce the number of intersections ○ Different datasources for different use cases 2016-11-15 2016-11-15 2016-11-15 Timestamp Attribute Count Distinct Timestamp Attribute Region Count Distinct US XXXXXX US Porsche Intent XXXXXX Porsche Intent ... ...... XXXXXX ...
  29. 29. Guidelines and pitfalls ● Query optimization ○ Combine multiple queries into single query ○ Use filters ○ Use groupBy v2 engine (default since 0.10.0) ○ Use timeseries rather than groupBy queries (where applicable)
  30. 30. Guidelines and pitfalls ● Batch Ingestion ○ EMR Tuning ■ 140-nodes cluster ● 85% spot instances => ~80% cost reduction ○ Druid input file format - Parquet vs CSV ■ Reduced indexing time by X4 ■ Reduced used storage by X10 ○ Concurrent ingestion tasks (one per EMR cluster and datasource) ■ Set worker select strategy to fillCapacityWithAffinity
  31. 31. Guidelines and pitfalls ● Batch Ingestion (WIP) ○ Action - pre-aggregating the data in Spark Streaming app ■ Aggregating the data by key ● groupBy().agg() for simple counts ● combineByKey() for distinct count (using the DataSketches packages) Requires setting isInputThetaSketch=true on ingestion task ■ Increased micro-batch interval from 30 minutes to 1 hour ○ Result : ■ # of output records is ~2000X smaller and total size of output files is less than 1%, compared to the previous version ■ 10X less nodes in the EMR cluster running the MapReduce ingestion job
  32. 32. Guidelines and pitfalls ● Community
  33. 33. Future work ● Improving accuracy for small set <-> big set intersections ● Improving query performance ○ groupBy V2 ○ NVMe SSDs ○ Switching to timeseries query type where applicable ○ Apply less granular aggregation where applicable (e.g 1 month rather than 1 day) ● Upgrading Druid to 0.11.0 ● Exploring option of tiering of query processing nodes ○ Reporting vs interactive queries ○ Hot vs cold data ● Using SQL interface (experimental) ● Using Lookups (experimental)
  34. 34. DRUID ES What have we learned? ● Druid is a columnar, time series data-store ● Can store trillions of events and serve analytic queries in sub-second ● Highly-scalable, cost-effective ● Widely used among Big Data companies ● Can be used for : ○ Distinct count (via ThetaSketch) ○ Simple counts ● Setup is not easy ● Ingestion has little effect on query performance (deep storage usage) ● Provides very good visibility ● Improve query performance by carefully designing your data model and building your queries
  35. 35. QUESTIONS? Join us - https://www.comeet.co/jobs/nielsen/33.000 Big Data Architect Java & Machine Learning Developer Junior Big Data Developer And more...
  36. 36. THANK YOU!https://www.linkedin.com/in/danny-ruchman-70211a27/ https://www.linkedin.com/in/itaiy/
  37. 37. Druid vs ES 10TB/day 4 Hours/day 15GB/day 280ms-350ms $55K/month DRUID 250GB/day 10 Hours/day 2.5TB (total) 500ms-6000ms $80K/month ES

×