Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Counting Unique Users in Real-Time:
Here’s a Challenge for You!
Yakir Buskilla & Itai Yaffe
Nielsen
Introduction
Yakir Buskilla Itai Yaffe
● VP R&D
● Focused on big data
processing and machine
learning solutions
● Tech Lea...
Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen on March 2015
● A Data company
● Machine learning models f...
Nielsen Marketing Cloud - high-level architecture
Nielsen Marketing Cloud - questions we try to answer
1. How many unique users of a certain profile can we reach?
E.g campa...
Audience Building Example
The need for Count Distinct
● Nielsen Marketing Cloud business question
○ How many unique devices we have encountered:
■ o...
● Store everything
● Store only 1 bit per device
○ 10B Devices-1.25 GB/day
○ 10B Devices*80K
attributes - 100 TB/day
● App...
Our journey
● Elasticsearch
○ Indexing data
■ 250 GB of daily data, 10 hours
■ Affect query time
○ Querying
■ Low concurre...
Query performance - Elasticsearch
What we tried
● Preprocessing
● Statistical algorithms (e.g HyperLogLog)
● K Minimum Values (KMV)
● Estimate set cardinality
● Supports set-theoretic operations
X Y
● ThetaSketch mathematical fra...
KMV intuition
Number of Std Dev 1 2
Confidence Interval 68.27% 95.45%
16,384 0.78% 1.56%
32,768 0.55% 1.10%
65,536 0.39% 0.78%
ThetaSket...
“Very fast highly scalable columnar data-store”
DRUID
Powered by Druid
Why is it cool?
● Store trillions of events, petabytes of data
● Sub-second analytic queries
● Highly scalable
● Cost effe...
LongSumAggregator
2016-11-15
Timestamp Attribute Device ID
11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 3a4c1f2...
Roll-up - Count Distinct
ThetaSketchAggregator
2016-11-15
Timestamp Attribute Device ID
11111 3a4c1f2d84a5c179435c1fea86e6...
Druid architecture
How do we use Druid
Query performance benchmark
Guidelines and pitfalls
● Setup is not easy
● Deployment is use-case dependent, e.g:
○ Deep storage - S3
○ No. of datasour...
● Monitoring your system
Guidelines and pitfalls
Guidelines and pitfalls
● Data modeling
○ Reduce the number of intersections
○ Different datasources for different use cas...
Guidelines and pitfalls
● Query optimization
○ Combine multiple queries into single query
○ Use filters
○ Use timeseries r...
Guidelines and pitfalls
● Batch Ingestion
○ EMR Tuning
■ 140-nodes cluster
● 85% spot instances => ~80% cost reduction
○ D...
Guidelines and pitfalls
● Batch Ingestion
○ Action - pre-aggregating the data in Spark app
■ Aggregating data by key
● gro...
Guidelines and pitfalls
● Community
Future work
● Research ways to improve accuracy for small set <-> large set intersections
● Further improve query performa...
What have we learned?
● Answering Count Distinct queries in real-time is a challenge!
○ Approximation algorithms FTW!
● Dr...
Want to know more?
● Women in Big Data
○ A world-wide program that aims:
■ To inspire, connect, grow, and champion success...
QUESTIONS
https://www.linkedin.com/in/yakirbuskilla/
https://www.linkedin.com/in/itaiy/
THANK YOU
Druid vs ES
10TB/day
4 Hours/day
15GB/day
280ms-350ms
$55K/month
DRUID
250GB/day
10 Hours/day
2.5TB (total)
500ms-6000ms
$...
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

3

Share

Counting Unique Users in Real-Time: Here's a Challenge for You!

Download to read offline

Finding the number of unique users out of 10 billion events per day is challenging. At this session, we're going to describe how re-architecting our data infrastructure, relying on Druid and ThetaSketch, enables our customers to obtain these insights in real-time.

To put things into context, at NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. Specifically, we provide them with the ability to see the number of unique users who meet a given criterion.

Historically, we have used Elasticsearch to answer these types of questions, however, we have encountered major scaling and stability issues.

In this presentation we will detail the journey of rebuilding our data infrastructure, including researching, benchmarking and productionizing a new technology, Druid, with ThetaSketch, to overcome the limitations we were facing.

We will also provide guidelines and best practices with regards to Druid.

Topics include :
* The need and possible solutions
* Intro to Druid and ThetaSketch
* How we use Druid
* Guidelines and pitfalls

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Counting Unique Users in Real-Time: Here's a Challenge for You!

  1. 1. Counting Unique Users in Real-Time: Here’s a Challenge for You! Yakir Buskilla & Itai Yaffe Nielsen
  2. 2. Introduction Yakir Buskilla Itai Yaffe ● VP R&D ● Focused on big data processing and machine learning solutions ● Tech Lead, Big Data group ● Dealing with Big Data challenges since 2012
  3. 3. Nielsen Marketing Cloud (NMC) ● eXelate was acquired by Nielsen on March 2015 ● A Data company ● Machine learning models for insights ● Business decisions ● Targeting
  4. 4. Nielsen Marketing Cloud - high-level architecture
  5. 5. Nielsen Marketing Cloud - questions we try to answer 1. How many unique users of a certain profile can we reach? E.g campaign for young women who love tech 2. How many impressions a campaign received?
  6. 6. Audience Building Example
  7. 7. The need for Count Distinct ● Nielsen Marketing Cloud business question ○ How many unique devices we have encountered: ■ over a given date range ■ for a given set of attributes (segments, regions, etc.) ● Find the number of distinct elements in a data stream which may contain repeated elements in real time
  8. 8. ● Store everything ● Store only 1 bit per device ○ 10B Devices-1.25 GB/day ○ 10B Devices*80K attributes - 100 TB/day ● Approximate Possible solutions for Count Distinct Naive Bit Vector Approx.
  9. 9. Our journey ● Elasticsearch ○ Indexing data ■ 250 GB of daily data, 10 hours ■ Affect query time ○ Querying ■ Low concurrency ■ Scans on all the shards of the corresponding index
  10. 10. Query performance - Elasticsearch
  11. 11. What we tried ● Preprocessing ● Statistical algorithms (e.g HyperLogLog)
  12. 12. ● K Minimum Values (KMV) ● Estimate set cardinality ● Supports set-theoretic operations X Y ● ThetaSketch mathematical framework - generalization of KMV X Y ThetaSketch
  13. 13. KMV intuition
  14. 14. Number of Std Dev 1 2 Confidence Interval 68.27% 95.45% 16,384 0.78% 1.56% 32,768 0.55% 1.10% 65,536 0.39% 0.78% ThetaSketch error Error as function of K
  15. 15. “Very fast highly scalable columnar data-store” DRUID
  16. 16. Powered by Druid
  17. 17. Why is it cool? ● Store trillions of events, petabytes of data ● Sub-second analytic queries ● Highly scalable ● Cost effective ● Decoupled architecture ○ E.g ingestion is separated from query
  18. 18. LongSumAggregator 2016-11-15 Timestamp Attribute Device ID 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Attribute Simple Count 2016-11-15 2016-11-15 2016-11-15 11111 22222 33333 3 1 1 Roll-up - Simple Count
  19. 19. Roll-up - Count Distinct ThetaSketchAggregator 2016-11-15 Timestamp Attribute Device ID 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Atritbute Count Distinct* 2016-11-15 2016-11-15 2016-11-15 11111 22222 33333 2* 1* 1* * What is actually stored is a ThetaSketch object. The actual result is calculated in real-time, which allows us to do UNIONs and INTERSECTs
  20. 20. Druid architecture
  21. 21. How do we use Druid
  22. 22. Query performance benchmark
  23. 23. Guidelines and pitfalls ● Setup is not easy ● Deployment is use-case dependent, e.g: ○ Deep storage - S3 ○ No. of datasources - <10 (all are ThetaSketch) ○ Data size on cluster - >30TB ○ Broker nodes - 3 X r4.8xlarge (32 cores, 244GB RAM each) ○ Historical nodes - 17 X i3.8xlarge (32 cores, 244GB RAM each, NVMe SSD)
  24. 24. ● Monitoring your system Guidelines and pitfalls
  25. 25. Guidelines and pitfalls ● Data modeling ○ Reduce the number of intersections ○ Different datasources for different use cases 2016-11-15 2016-11-15 2016-11-15 Timestamp Attribute Count Distinct Timestamp Attribute Region Count Distinct US XXXXXX US Porsche Intent XXXXXX Porsche Intent ... ...... XXXXXX ...
  26. 26. Guidelines and pitfalls ● Query optimization ○ Combine multiple queries into single query ○ Use filters ○ Use timeseries rather than groupBy queries (where applicable) ○ Use groupBy v2 engine (default since 0.10.0)
  27. 27. Guidelines and pitfalls ● Batch Ingestion ○ EMR Tuning ■ 140-nodes cluster ● 85% spot instances => ~80% cost reduction ○ Druid input file format - Parquet vs CSV ■ Reduced indexing time by X4 ■ Reduced used storage by X10
  28. 28. Guidelines and pitfalls ● Batch Ingestion ○ Action - pre-aggregating the data in Spark app ■ Aggregating data by key ● groupBy() - for simple counts ● combineByKey() - for distinct count (using the DataSketches packages) ■ Decreasing execution frequency ● E.g every 1 hour (rather than every 30 minutes) ○ Result: ■ # of output records is ~2000X smaller and total size of output files is less than 1%, compared to the previous version ■ 10X less nodes in the EMR cluster running the MapReduce ingestion job ■ Another 80% cost reduction, $2.64M/year -> $0.47M/year!
  29. 29. Guidelines and pitfalls ● Community
  30. 30. Future work ● Research ways to improve accuracy for small set <-> large set intersections ● Further improve query performance ● Explore option of tiering of query processing nodes ○ Reporting vs interactive queries ○ Hot vs cold data ● Version upgrades
  31. 31. What have we learned? ● Answering Count Distinct queries in real-time is a challenge! ○ Approximation algorithms FTW! ● Druid provides a concrete implementation of the ThetaSketch mathematical framework ○ A columnar, time series data-store ○ Can store trillions of events and serve analytic queries in sub-second ○ Highly-scalable, cost-effective and widely used among Big Data companies ○ Can be used for: ■ Distinct count (via ThetaSketch) ■ Simple counts ● Words of wisdom: ○ Setup is not easy, using online resources (documentation, community) can help ○ Ingestion has little effect on query performance (deep storage usage) ○ Provides very good visibility ○ Improve query performance by carefully designing your data model and building your queries
  32. 32. Want to know more? ● Women in Big Data ○ A world-wide program that aims: ■ To inspire, connect, grow, and champion success of women in Big Data. ■ To grow women representation in Big Data field > 25% by 2020 ○ Visit the website (https://www.womeninbigdata.org/) and join the Women in Big Data Luncheon today (12:30PM, http://tinyurl.com/y2mycox4)! ● Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka ○ Tomorrow, 2:50 PM - 3:30 PM Room 127-128, http://tinyurl.com/y5vfmq5p ● NMC Tech Blog - https://medium.com/nmc-techblog
  33. 33. QUESTIONS
  34. 34. https://www.linkedin.com/in/yakirbuskilla/ https://www.linkedin.com/in/itaiy/ THANK YOU
  35. 35. Druid vs ES 10TB/day 4 Hours/day 15GB/day 280ms-350ms $55K/month DRUID 250GB/day 10 Hours/day 2.5TB (total) 500ms-6000ms $80K/month ES
  • ItaiYaffe

    Nov. 11, 2019
  • StreamingAnalytics

    Sep. 1, 2019
  • abhijit.sharma

    Aug. 23, 2019

Finding the number of unique users out of 10 billion events per day is challenging. At this session, we're going to describe how re-architecting our data infrastructure, relying on Druid and ThetaSketch, enables our customers to obtain these insights in real-time. To put things into context, at NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. Specifically, we provide them with the ability to see the number of unique users who meet a given criterion. Historically, we have used Elasticsearch to answer these types of questions, however, we have encountered major scaling and stability issues. In this presentation we will detail the journey of rebuilding our data infrastructure, including researching, benchmarking and productionizing a new technology, Druid, with ThetaSketch, to overcome the limitations we were facing. We will also provide guidelines and best practices with regards to Druid. Topics include : * The need and possible solutions * Intro to Druid and ThetaSketch * How we use Druid * Guidelines and pitfalls

Views

Total views

1,689

On Slideshare

0

From embeds

0

Number of embeds

3

Actions

Downloads

38

Shares

0

Comments

0

Likes

3

×