Counting Unique Users in Real-Time: Here's a Challenge for You!

DataWorks Summit
Apr. 1, 2019
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
1 of 35

More Related Content

What's hot

Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
Websphere MQ (MQSeries) fundamentalsWebsphere MQ (MQSeries) fundamentals
Websphere MQ (MQSeries) fundamentalsBiju Nair
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark Summit
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf

What's hot(20)

Similar to Counting Unique Users in Real-Time: Here's a Challenge for You!

Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleItai Yaffe
Using druid for interactive count distinct queries at scaleUsing druid for interactive count distinct queries at scale
Using druid for interactive count distinct queries at scaleItai Yaffe
Using druid  for interactive count distinct queries at scale @ nmcUsing druid  for interactive count distinct queries at scale @ nmc
Using druid for interactive count distinct queries at scale @ nmcIdo Shilon
Data_and_Analytics_Industry_IESE_v3.pdfData_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdfprevota
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax

Similar to Counting Unique Users in Real-Time: Here's a Challenge for You!(20)

More from DataWorks Summit

Data Science Crash CourseData Science Crash Course
Data Science Crash CourseDataWorks Summit
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit

More from DataWorks Summit(20)

Recently uploaded

The Ultimate Administrator’s Guide to HCL Nomad WebThe Ultimate Administrator’s Guide to HCL Nomad Web
The Ultimate Administrator’s Guide to HCL Nomad Webpanagenda
Unleashing Innovation: IoT Project with MicroPythonUnleashing Innovation: IoT Project with MicroPython
Unleashing Innovation: IoT Project with MicroPythonVubon Roy
Announcing InfluxDB ClusteredAnnouncing InfluxDB Clustered
Announcing InfluxDB ClusteredInfluxData
Prompt Engineering - an Art, a Science, or your next Job Title?Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Maxim Salnikov
Accelerating Data Science through Feature Platform, Transformers, and GenAIAccelerating Data Science through Feature Platform, Transformers, and GenAI
Accelerating Data Science through Feature Platform, Transformers, and GenAIFeatureByte
Getting your enterprise ready for Microsoft 365 CopilotGetting your enterprise ready for Microsoft 365 Copilot
Getting your enterprise ready for Microsoft 365 CopilotVignesh Ganesan I Microsoft MVP

Counting Unique Users in Real-Time: Here's a Challenge for You!

Editor's Notes

  1. Thank you for coming to hear our story about the challenge of counting unique users in real-time We will try to make it interesting and valuable for you
  2. Yakir, At NMC since August 2015 Leading the R&D and managing the site in Israel Going to talk about NMC - about us, high-level architecture The questions we try to answer Put you in context of druid path Itai, At NMC since May 2014 Tech lead of the big data group One of the engineers who lead the druid research and implementation effort Will talk about the druid from idea to production, will give super cool tips for beginners Questions - at the end of the session
  3. Data Company - which means that we get or buy data from our partners in various ways Online and Offline We enrich the data - which in our case is generating attributes Attribute - that we assign to a device based on the data that we have, for example Sports Fan, Eats Organic food etc.. The enriched data that we generate help support our clients’ business decision and also allows them to Target the relevant audiences Nielsen marketing cloud or NMC in short A group inside Nielsen, Born from eXelate company that was acquired by Nielsen on March 2015 Nielsen is a data company and so are we and we had strong business relationship until at some point they decided to go for it and acquired exelate Data company meaning Buying and onboarding data into NMC from data providers, customers and Nielsen data We have huge high quality dataset enrich the data using machine learning models in order to create more relevant quality insights categorize and sell according to a need Helping brands to take intelligence business decisions E.g. Targeting in the digital marketing world Meaning help fit ads to viewers For example street sign can fit to a very small % of people who see it vs Online ads that can fit the profile of the individual that sees it More interesting to the user More chances he will click the ad Better ROI for the marketer
  4. A few words on NMC data pipeline architecture: Frontend layer: Receives all the online and offline data traffic Bare metal on different data centers (3 in US, 2 in EU ,3 in APAC) near real time - high throughput/low latency challenges Backend layer Aws Cloud based process all the frontend layer outputs ETL’s - load data to data sources aggregated and raw Applications layer Also in the cloud Variety of apps above all our data sources Web - NMC data configurations (segments, audiences etc) campaign analysis , campaign management tools etc. visualized profile graphs reports
  5. What are the questions we try to answer in NMC that help our customers to take business decisions ? A lot of questions but to lead to what druid is coming to solveֿ Translating from human problem to technical problem: UU (distinct) count Simple count
  6. Demo
  7. Danny talked about the 2 main use-cases - counting unique users and counting hits (or “simple counts”). The first one is somewhat harder, so this is going to be at the focus of my part of the presentation Past… Mention “cardinality” and “real-time dashboard” Explain the need to union and intersect
  8. Who’s familiar with the count-distinct problem? For the 2 first solutions, we need to store data per device per attribute per day Bit vector - Elasticsearch /Redis is an example of such system Approximation has a certain error rate
  9. Who is familiar with Elasticsearch? In ES, we stored the raw data, where every device was a document, and each such document contained all events for that device A screen in our SaaS application can generate up to thousands of queries We tried to introduce new cluster dedicated for indexing only and then use backup and restore to the second cluster This method was very expensive and was partially helpful Tuning for better performance also didn’t help too much The story about the demo when we were at the bar (December 20th, 2016)...
  10. Preprocessing - Too many combinations - The formula length is not bounded (show some numbers) HyperLogLog -Implementation in Elasticsearch was too slow (done on query time) - Set operations increase the error dramatically
  11. ThetaSketch is based on KMV Explain what KMV is and the K size affect Unions and Intersections increase the error The problematic case is intersection of very small set with very big set
  12. The larger the K the smaller the Error However larger K means more memory & storage needed Demo - http://content.research.neustar.biz/blog/kmv.html
  13. So we talked about statistical algorithms, which is nice, but we needed a practical solution… OOTB supports ThetaSketch algorithm Open source, written in Java (works for us, as we know Java…) Who’s familiar with Druid?
  14. Just to give you a sense of where Druid is used in production... http://druid.io/druid-powered.html
  15. I’ll try to cover all these reasons in the next slides
  16. Timeseries database - first thing you need to know about Druid Column types : Timestamp Dimensions Metrics Together they comprise a Datasource Agg is done on ingestion time (outcome is much smaller in size) In query time, it’s closer to a key-value search
  17. We’re just using a different type of aggregator (the ThetaSketch aggregator) to get count distinct, but everything else is essentially the same. The one big difference is multiple ingestions of the same data : For ThetaSketch - not a problem due to the fact is “samples” the data (chooses the K minimal values); Whereas for Sum - we’re going to get wrong numbers (e.g 2X as big if we ingest the data twice) To mitigate it, we’ve added a simple meta-data store to prevent ingesting the same data twice (you can look at it as some kind of “checkpointing”)
  18. We have 3 types of processes - ingestion, querying, managementAll processes are decoupled and scalable Ingestion (real time - e.g from Kafka, batch - talk about deep storage, how data is aggregated in ingestion time)Querying (brokers, historicals, query performance during ingestion vs ES) Lambda architecture (for those who don’t know - it’s “a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.”)
  19. Explain the tuple and what is happening during the aggregation Mention it says “ThetaSketchAggregator”, but again - we also use LongSumAggregator for simple counts We ingest a lot more data today than we did in ES (10B events/day in TBs of data vs 250GB in ES) I mentioned our meta-data store earlier - each component in the flow updates that meta-data store, to prevent ingestion of the same data twice
  20. We can see that while ES response time is exponentially increasing, Druid response time is relatively stable Benchmark using : Druid Cluster : 1x Broker (r3.8xlarge) , 8x Historical (r3.8xlarge) Elasticsearch Cluster : 20 nodes (r3.8xlarge) This is how we use it, now switching to how we got there and the pains...
  21. Setup is not easy Separate config/servers/tuning Caused the deployment to take a few months Use the Druid recommendation for Production configuration
  22. Monitoring Your System Druid has built in support for Graphite ( exports many metrics ), so does Spark.We also export metrics to Graphite from our ingestion tasks (written in Python) and from the NMC backend (aesreporter) to provide a complete, end-to-end view of the system. In this example, query time is very high due to a high number of pending segments (i.e segments that are queued to be scanned in order to answer the query)
  23. Data Modeling If using Theta sketch - reduce the number of intersections (show a slide of the old and new data model).In this example, US is a very large set, Porsche intent is (probably) a small setIt didn’t solve all use-cases, but it gives you an idea of how you can approach the problem Different datasources - e.g lower accuracy (i.e lower K) for faster queries VS higher accuracy with a bit slower queries
  24. Combine multiple queries over the REST API (explain why?) There can be billions of rows, so filter the data as part of the query Switching from groupBy to timeseries query seems to have solved the “io.druid.java.util.common.IAE: Not enough capacity for even one row! Need[1,509,995,528] but have[0].” we had groupBy v2 offers better performance and memory management (e.g generates per-segment results using a fully off-heap map)
  25. EMR tuning (spot instances (80% cost reduction, but it comes with a risk of out-biding and nodes are lost), druid MR prod config) Use Parquet Affinity - use fillCapacityWithAffinity to ingest data from multiple EMR clusters to the same Druid cluster (but different datasources) concurrently, see http://druid.io/docs/latest/configuration/indexing-service.html#affinity
  26. Why? Ingestion still takes a lot of time and resources There was almost no “penalty” on the Spark Streaming app (with the new version of the app) For count distinct, we use the DataSketches packages plus combineByKey(). This requires setting isInputThetaSketch=true on ingestion task Decreasing execution frequency (e.g every 1 hour instead of every 30 minutes) allows a more significant aggregation ratio
  27. Ingestion has little effect on query + sub-second response for even 100s or 1000s of concurrent queries With Druid and ThetaSketch, we’ve improved our ingestion volume and query performance and concurrency by an order of magnitude with a lesser cost, compared to our old solution (We’ve achieved a more performant, scalable, cost-effective solution) Nice comparison of open-source OLAP systems for Big Data here - https://medium.com/@leventov/comparison-of-the-open-source-olap-systems-for-big-data-clickhouse-druid-and-pinot-8e042a5ed1c7
  28. Ingestion has little effect on query + sub-second response for even 100s or 1000s of concurrent queries Cost is for the entire solution (Druid cluster, EMR, etc.) With Druid and ThetaSketch, we’ve improved our ingestion volume and query performance and concurrency by an order of magnitude with a lesser cost, compared to our old solution (We’ve achieved a more performant, scalable, cost-effective solution)