Successfully reported this slideshow.
Your SlideShare is downloading. ×

Counting Unique Users in Real-Time: Here's a Challenge for You!

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
BigData Hadoop
BigData Hadoop
Loading in …3
×

Check these out next

1 of 35 Ad

Counting Unique Users in Real-Time: Here's a Challenge for You!

Download to read offline

Finding the number of unique users out of 10 billion events per day is challenging. At this session, we're going to describe how re-architecting our data infrastructure, relying on Druid and ThetaSketch, enables our customers to obtain these insights in real-time.

To put things into context, at NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. Specifically, we provide them with the ability to see the number of unique users who meet a given criterion.

Historically, we have used Elasticsearch to answer these types of questions, however, we have encountered major scaling and stability issues.

In this presentation we will detail the journey of rebuilding our data infrastructure, including researching, benchmarking and productionizing a new technology, Druid, with ThetaSketch, to overcome the limitations we were facing.

We will also provide guidelines and best practices with regards to Druid.

Topics include :
* The need and possible solutions
* Intro to Druid and ThetaSketch
* How we use Druid
* Guidelines and pitfalls

Finding the number of unique users out of 10 billion events per day is challenging. At this session, we're going to describe how re-architecting our data infrastructure, relying on Druid and ThetaSketch, enables our customers to obtain these insights in real-time.

To put things into context, at NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. Specifically, we provide them with the ability to see the number of unique users who meet a given criterion.

Historically, we have used Elasticsearch to answer these types of questions, however, we have encountered major scaling and stability issues.

In this presentation we will detail the journey of rebuilding our data infrastructure, including researching, benchmarking and productionizing a new technology, Druid, with ThetaSketch, to overcome the limitations we were facing.

We will also provide guidelines and best practices with regards to Druid.

Topics include :
* The need and possible solutions
* Intro to Druid and ThetaSketch
* How we use Druid
* Guidelines and pitfalls

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Counting Unique Users in Real-Time: Here's a Challenge for You! (20)

Advertisement

More from DataWorks Summit (20)

Recently uploaded (20)

Advertisement

Counting Unique Users in Real-Time: Here's a Challenge for You!

  1. 1. Counting Unique Users in Real-Time: Here’s a Challenge for You! Yakir Buskilla & Itai Yaffe Nielsen
  2. 2. Introduction Yakir Buskilla Itai Yaffe ● VP R&D ● Focused on big data processing and machine learning solutions ● Tech Lead, Big Data group ● Dealing with Big Data challenges since 2012
  3. 3. Nielsen Marketing Cloud (NMC) ● eXelate was acquired by Nielsen on March 2015 ● A Data company ● Machine learning models for insights ● Business decisions ● Targeting
  4. 4. Nielsen Marketing Cloud - high-level architecture
  5. 5. Nielsen Marketing Cloud - questions we try to answer 1. How many unique users of a certain profile can we reach? E.g campaign for young women who love tech 2. How many impressions a campaign received?
  6. 6. Audience Building Example
  7. 7. The need for Count Distinct ● Nielsen Marketing Cloud business question ○ How many unique devices we have encountered: ■ over a given date range ■ for a given set of attributes (segments, regions, etc.) ● Find the number of distinct elements in a data stream which may contain repeated elements in real time
  8. 8. ● Store everything ● Store only 1 bit per device ○ 10B Devices-1.25 GB/day ○ 10B Devices*80K attributes - 100 TB/day ● Approximate Possible solutions for Count Distinct Naive Bit Vector Approx.
  9. 9. Our journey ● Elasticsearch ○ Indexing data ■ 250 GB of daily data, 10 hours ■ Affect query time ○ Querying ■ Low concurrency ■ Scans on all the shards of the corresponding index
  10. 10. Query performance - Elasticsearch
  11. 11. What we tried ● Preprocessing ● Statistical algorithms (e.g HyperLogLog)
  12. 12. ● K Minimum Values (KMV) ● Estimate set cardinality ● Supports set-theoretic operations X Y ● ThetaSketch mathematical framework - generalization of KMV X Y ThetaSketch
  13. 13. KMV intuition
  14. 14. Number of Std Dev 1 2 Confidence Interval 68.27% 95.45% 16,384 0.78% 1.56% 32,768 0.55% 1.10% 65,536 0.39% 0.78% ThetaSketch error Error as function of K
  15. 15. “Very fast highly scalable columnar data-store” DRUID
  16. 16. Powered by Druid
  17. 17. Why is it cool? ● Store trillions of events, petabytes of data ● Sub-second analytic queries ● Highly scalable ● Cost effective ● Decoupled architecture ○ E.g ingestion is separated from query
  18. 18. LongSumAggregator 2016-11-15 Timestamp Attribute Device ID 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Attribute Simple Count 2016-11-15 2016-11-15 2016-11-15 11111 22222 33333 3 1 1 Roll-up - Simple Count
  19. 19. Roll-up - Count Distinct ThetaSketchAggregator 2016-11-15 Timestamp Attribute Device ID 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Atritbute Count Distinct* 2016-11-15 2016-11-15 2016-11-15 11111 22222 33333 2* 1* 1* * What is actually stored is a ThetaSketch object. The actual result is calculated in real-time, which allows us to do UNIONs and INTERSECTs
  20. 20. Druid architecture
  21. 21. How do we use Druid
  22. 22. Query performance benchmark
  23. 23. Guidelines and pitfalls ● Setup is not easy ● Deployment is use-case dependent, e.g: ○ Deep storage - S3 ○ No. of datasources - <10 (all are ThetaSketch) ○ Data size on cluster - >30TB ○ Broker nodes - 3 X r4.8xlarge (32 cores, 244GB RAM each) ○ Historical nodes - 17 X i3.8xlarge (32 cores, 244GB RAM each, NVMe SSD)
  24. 24. ● Monitoring your system Guidelines and pitfalls
  25. 25. Guidelines and pitfalls ● Data modeling ○ Reduce the number of intersections ○ Different datasources for different use cases 2016-11-15 2016-11-15 2016-11-15 Timestamp Attribute Count Distinct Timestamp Attribute Region Count Distinct US XXXXXX US Porsche Intent XXXXXX Porsche Intent ... ...... XXXXXX ...
  26. 26. Guidelines and pitfalls ● Query optimization ○ Combine multiple queries into single query ○ Use filters ○ Use timeseries rather than groupBy queries (where applicable) ○ Use groupBy v2 engine (default since 0.10.0)
  27. 27. Guidelines and pitfalls ● Batch Ingestion ○ EMR Tuning ■ 140-nodes cluster ● 85% spot instances => ~80% cost reduction ○ Druid input file format - Parquet vs CSV ■ Reduced indexing time by X4 ■ Reduced used storage by X10
  28. 28. Guidelines and pitfalls ● Batch Ingestion ○ Action - pre-aggregating the data in Spark app ■ Aggregating data by key ● groupBy() - for simple counts ● combineByKey() - for distinct count (using the DataSketches packages) ■ Decreasing execution frequency ● E.g every 1 hour (rather than every 30 minutes) ○ Result: ■ # of output records is ~2000X smaller and total size of output files is less than 1%, compared to the previous version ■ 10X less nodes in the EMR cluster running the MapReduce ingestion job ■ Another 80% cost reduction, $2.64M/year -> $0.47M/year!
  29. 29. Guidelines and pitfalls ● Community
  30. 30. Future work ● Research ways to improve accuracy for small set <-> large set intersections ● Further improve query performance ● Explore option of tiering of query processing nodes ○ Reporting vs interactive queries ○ Hot vs cold data ● Version upgrades
  31. 31. What have we learned? ● Answering Count Distinct queries in real-time is a challenge! ○ Approximation algorithms FTW! ● Druid provides a concrete implementation of the ThetaSketch mathematical framework ○ A columnar, time series data-store ○ Can store trillions of events and serve analytic queries in sub-second ○ Highly-scalable, cost-effective and widely used among Big Data companies ○ Can be used for: ■ Distinct count (via ThetaSketch) ■ Simple counts ● Words of wisdom: ○ Setup is not easy, using online resources (documentation, community) can help ○ Ingestion has little effect on query performance (deep storage usage) ○ Provides very good visibility ○ Improve query performance by carefully designing your data model and building your queries
  32. 32. Want to know more? ● Women in Big Data ○ A world-wide program that aims: ■ To inspire, connect, grow, and champion success of women in Big Data. ■ To grow women representation in Big Data field > 25% by 2020 ○ Visit the website (https://www.womeninbigdata.org/) and join the Women in Big Data Luncheon today (12:30PM, http://tinyurl.com/y2mycox4)! ● Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka ○ Tomorrow, 2:50 PM - 3:30 PM Room 127-128, http://tinyurl.com/y5vfmq5p ● NMC Tech Blog - https://medium.com/nmc-techblog
  33. 33. QUESTIONS
  34. 34. https://www.linkedin.com/in/yakirbuskilla/ https://www.linkedin.com/in/itaiy/ THANK YOU
  35. 35. Druid vs ES 10TB/day 4 Hours/day 15GB/day 280ms-350ms $55K/month DRUID 250GB/day 10 Hours/day 2.5TB (total) 500ms-6000ms $80K/month ES

Editor's Notes

  • Thank you for coming to hear our story about the challenge of counting unique users in real-time
    We will try to make it interesting and valuable for you
  • Yakir, At NMC since August 2015
    Leading the R&amp;D and managing the site in Israel
    Going to talk about
    NMC - about us, high-level architecture
    The questions we try to answer
    Put you in context of druid path
    Itai, At NMC since May 2014
    Tech lead of the big data group
    One of the engineers who lead the druid research and implementation effort
    Will talk about the druid from idea to production, will give super cool tips for beginners
    Questions - at the end of the session
  • Data Company - which means that we get or buy data from our partners in various ways Online and Offline
    We enrich the data - which in our case is generating attributes
    Attribute - that we assign to a device based on the data that we have, for example Sports Fan, Eats Organic food etc..
    The enriched data that we generate help support our clients’ business decision and also allows them to Target the relevant audiences
    Nielsen marketing cloud or NMC in short
    A group inside Nielsen,
    Born from eXelate company that was acquired by Nielsen on March 2015
    Nielsen is a data company and so are we and we had strong business relationship until at some point they decided to go for it and acquired exelate
    Data company meaning
    Buying and onboarding data into NMC from data providers, customers and Nielsen data
    We have huge high quality dataset
    enrich the data using machine learning models in order to create more relevant quality insights
    categorize and sell according to a need
    Helping brands to take intelligence business decisions
    E.g. Targeting in the digital marketing world
    Meaning help fit ads to viewers
    For example street sign can fit to a very small % of people who see it vs
    Online ads that can fit the profile of the individual that sees it
    More interesting to the user
    More chances he will click the ad
    Better ROI for the marketer
  • A few words on NMC data pipeline architecture:
    Frontend layer:
    Receives all the online and offline data traffic
    Bare metal on different data centers (3 in US, 2 in EU ,3 in APAC)
    near real time - high throughput/low latency challenges
    Backend layer
    Aws Cloud based
    process all the frontend layer outputs
    ETL’s - load data to data sources aggregated and raw
    Applications layer
    Also in the cloud
    Variety of apps above all our data sources
    Web - NMC
    data configurations (segments, audiences etc)
    campaign analysis , campaign management tools etc.
    visualized profile graphs
    reports
  • What are the questions we try to answer in NMC that help our customers to take business decisions ?
    A lot of questions but to lead to what druid is coming to solveֿ
    Translating from human problem to technical problem:
    UU (distinct) count
    Simple count
  • Demo
  • Danny talked about the 2 main use-cases - counting unique users and counting hits (or “simple counts”). The first one is somewhat harder, so this is going to be at the focus of my part of the presentation
    Past…
    Mention “cardinality” and “real-time dashboard”
    Explain the need to union and intersect
  • Who’s familiar with the count-distinct problem?
    For the 2 first solutions, we need to store data per device per attribute per day
    Bit vector - Elasticsearch /Redis is an example of such system
    Approximation has a certain error rate
  • Who is familiar with Elasticsearch?
    In ES, we stored the raw data, where every device was a document, and each such document contained all events for that device
    A screen in our SaaS application can generate up to thousands of queries
    We tried to introduce new cluster dedicated for indexing only and then use backup and restore to the second cluster
    This method was very expensive and was partially helpful
    Tuning for better performance also didn’t help too much
    The story about the demo when we were at the bar (December 20th, 2016)...
  • Preprocessing - Too many combinations - The formula length is not bounded (show some numbers)
    HyperLogLog
    -Implementation in Elasticsearch was too slow (done on query time)
    - Set operations increase the error dramatically
  • ThetaSketch is based on KMV
    Explain what KMV is and the K size affect
    Unions and Intersections increase the error
    The problematic case is intersection of very small set with very big set
  • The larger the K the smaller the Error
    However larger K means more memory &amp; storage needed
    Demo - http://content.research.neustar.biz/blog/kmv.html
  • So we talked about statistical algorithms, which is nice, but we needed a practical solution…
    OOTB supports ThetaSketch algorithm
    Open source, written in Java (works for us, as we know Java…)
    Who’s familiar with Druid?
  • Just to give you a sense of where Druid is used in production...
    http://druid.io/druid-powered.html
  • I’ll try to cover all these reasons in the next slides
  • Timeseries database - first thing you need to know about Druid
    Column types :
    Timestamp
    Dimensions
    Metrics
    Together they comprise a Datasource
    Agg is done on ingestion time (outcome is much smaller in size)
    In query time, it’s closer to a key-value search
  • We’re just using a different type of aggregator (the ThetaSketch aggregator) to get count distinct, but everything else is essentially the same.
    The one big difference is multiple ingestions of the same data :
    For ThetaSketch - not a problem due to the fact is “samples” the data (chooses the K minimal values);
    Whereas for Sum - we’re going to get wrong numbers (e.g 2X as big if we ingest the data twice)
    To mitigate it, we’ve added a simple meta-data store to prevent ingesting the same data twice (you can look at it as some kind of “checkpointing”)
  • We have 3 types of processes - ingestion, querying, managementAll processes are decoupled and scalable
    Ingestion (real time - e.g from Kafka, batch - talk about deep storage, how data is aggregated in ingestion time)Querying (brokers, historicals, query performance during ingestion vs ES)
    Lambda architecture (for those who don’t know - it’s “a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.”)
  • Explain the tuple and what is happening during the aggregation
    Mention it says “ThetaSketchAggregator”, but again - we also use LongSumAggregator for simple counts
    We ingest a lot more data today than we did in ES (10B events/day in TBs of data vs 250GB in ES)
    I mentioned our meta-data store earlier - each component in the flow updates that meta-data store, to prevent ingestion of the same data twice
  • We can see that while ES response time is exponentially increasing, Druid response time is relatively stable
    Benchmark using :
    Druid Cluster : 1x Broker (r3.8xlarge) , 8x Historical (r3.8xlarge)
    Elasticsearch Cluster : 20 nodes (r3.8xlarge)
    This is how we use it, now switching to how we got there and the pains...
  • Setup is not easy
    Separate config/servers/tuning
    Caused the deployment to take a few months
    Use the Druid recommendation for Production configuration
  • Monitoring Your System
    Druid has built in support for Graphite ( exports many metrics ), so does Spark.We also export metrics to Graphite from our ingestion tasks (written in Python) and from the NMC backend (aesreporter) to provide a complete, end-to-end view of the system.
    In this example, query time is very high due to a high number of pending segments (i.e segments that are queued to be scanned in order to answer the query)
  • Data Modeling
    If using Theta sketch - reduce the number of intersections (show a slide of the old and new data model).In this example, US is a very large set, Porsche intent is (probably) a small setIt didn’t solve all use-cases, but it gives you an idea of how you can approach the problem
    Different datasources - e.g lower accuracy (i.e lower K) for faster queries VS higher accuracy with a bit slower queries
  • Combine multiple queries over the REST API (explain why?)
    There can be billions of rows, so filter the data as part of the query
    Switching from groupBy to timeseries query seems to have solved the “io.druid.java.util.common.IAE: Not enough capacity for even one row! Need[1,509,995,528] but have[0].” we had
    groupBy v2 offers better performance and memory management (e.g generates per-segment results using a fully off-heap map)
  • EMR tuning (spot instances (80% cost reduction, but it comes with a risk of out-biding and nodes are lost), druid MR prod config)
    Use Parquet
    Affinity - use fillCapacityWithAffinity to ingest data from multiple EMR clusters to the same Druid cluster (but different datasources) concurrently, see http://druid.io/docs/latest/configuration/indexing-service.html#affinity
  • Why? Ingestion still takes a lot of time and resources
    There was almost no “penalty” on the Spark Streaming app (with the new version of the app)
    For count distinct, we use the DataSketches packages plus combineByKey(). This requires setting isInputThetaSketch=true on ingestion task
    Decreasing execution frequency (e.g every 1 hour instead of every 30 minutes) allows a more significant aggregation ratio
  • Ingestion has little effect on query + sub-second response for even 100s or 1000s of concurrent queries
    With Druid and ThetaSketch, we’ve improved our ingestion volume and query performance and concurrency by an order of magnitude with a lesser cost, compared to our old solution
    (We’ve achieved a more performant, scalable, cost-effective solution)
    Nice comparison of open-source OLAP systems for Big Data here - https://medium.com/@leventov/comparison-of-the-open-source-olap-systems-for-big-data-clickhouse-druid-and-pinot-8e042a5ed1c7
  • Ingestion has little effect on query + sub-second response for even 100s or 1000s of concurrent queries
    Cost is for the entire solution (Druid cluster, EMR, etc.)
    With Druid and ThetaSketch, we’ve improved our ingestion volume and query performance and concurrency by an order of magnitude with a lesser cost, compared to our old solution
    (We’ve achieved a more performant, scalable, cost-effective solution)

×