Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Case Study: Realtime Analytics with Druid

1,601 views

Published on

The case study is about ViralGains - a US based video marketing platform. The presentation was delivered by me (Salil Kalia) at Great Indian Developer Summit (GIDS) 2016. This is a piece of a great work that we have done at TO THE NEW Digital with our customer, ViralGains.

Here, I show-cased Druid (http://druid.io) and the supporting technologies (Kafka/Zookeeper) to demonstrate how it helped us in building a stable realtime analytics system, in capturing hundreds of millions of analytics events per day. When it comes to Ad industry - it becomes very important to be precise or close to precision because money is involved at every step (even for a single ad impression).

The case study included a demo and a short talk on their journey of moving from Redis to Cassandra and finally ending up on Druid with an outstanding performance.

Published in: Data & Analytics
  • Be the first to comment

Case Study: Realtime Analytics with Druid

  1. 1. Case Study: Real-time Analytics With Druid Salil Kalia, Tech Lead, TO THE NEW Digital
  2. 2. About Presenter • Over 10 years in software industry • Working with TO THE NEW Digital since 2009 • Using mainly Java/Groovy/Grails eco-systems for the development purpose • Working on Digital marketing domain for the last few years • Cassandra certified trainer • Loves traveling and exploring new places
  3. 3. Agenda Understanding the use-case • Ad workflow • Our use case Experiments with technologies • Redis • Cassandra Introduction to Druid • Architecture • Druid in production • Demo
  4. 4. Understanding the use-case
  5. 5. Understanding The Ad Workflow AD AGENCY-2 AD AGENCY-2 AD AGENCY-3 AD AGENCY-3 AD AGENCY-1 AD AGENCY-1 USER Web Page Request Ad Request Ad-Content PUBLISHER SERVER AD EXCHANGE
  6. 6. Examples From Our Use Case •How many times a video has been viewed ? •How many times a video has been viewed in a particular time-span ? •How many times a video has been viewed in a particular time-span at a particular site ? •How many times a video has been viewed in a particular time-span at a particular site in a particular country ? •How many times a video has been viewed in a particular time-span at a particular site in a particular country on a particular device ?
  7. 7. Video Events For The Analysis • LOAD • START • PLAYING • VIEW • STOP / PAUSE • FINISH
  8. 8. Event Data (Sample) TIMESTAMP Ad Site Advertiser Event Action 2011-01-01T01:01:27 Z 123 abc.com Brand X Player Load 2011-01-01T01:01:33 Z 234 abcd.com Brand Y Player Load 2011-01-01T01:01:40 Z 123 abc.com Brand X Player Start 2011-01-01T01:01:45 Z 123 abc.com Brand X Player Playing 2011-01-01T01:01:50 Z 123 abc.com Brand Y Player Playing 2011-01-01T01:01:51 Z 123 abc.com Brand X Player Stop
  9. 9. What Is Analytics ? Processing the HISTORICAL data to: •Understand potential trends •Analyze the effects of certain decisions or events •Evaluate the performance of a system •Make better business decisions
  10. 10. What Is Real-time Analytics ?
  11. 11. Why (We Need) Real-time Analytics ? • Understand the real-time performance • Control the velocity • Avoid over serving • Avoid under serving • Control the targeting
  12. 12. Recap – Things We Understood • How the ad-tech works (in general) • Our use-case • Different video player events • We are expecting a huge amount of data coming at a very high velocity.
  13. 13. Experiments with technologies
  14. 14. Why We Picked Redis • Great buzz in the market • Highly scalable • Easy to setup, configure and use • We were not very clear with our use-case
  15. 15. Realizations From Redis • Not a good fit to deal with time-series (big) data • Persistence is another issue – we can’t afford loosing data • There was a huge variety of keys all over the place • Complexity in the (application side) code started increasing
  16. 16. Working With Cassandra • Very good support for the time-series data • Extremely good for writing the data at a very high speed • Very easy to scale horizontally • Supports aggregations through Counters
  17. 17. Writing into Cassandra ANALYTICS SERVER CASSANDRA AD PLAYER
  18. 18. Reading from Cassandra ANALYTICS SERVER CASSANDRA CAMPAIGN MANAGER
  19. 19. What didn’t work with Cassandra • Inconsistent results • Unreliable counters • No ad-hoc queries support • Nodes were crashing out very frequently
  20. 20. Crossroads – What next ? • Third party tools on the top of Cassandra for better consistency • DataStax Enterprise edition • Taking a deeper dive into Cassandra to reconfigure the whole architecture and setup • Switching to different technology
  21. 21. Understanding druid
  22. 22. About Druid (http://druid.io) • An open-source analytics data store • Supports streaming - data ingestion • Flexible filters for ad-hoc queries • Fast aggregations – sub second queries • Distributed, shared-nothing architecture • Easily scalable
  23. 23. Setting Up Druid In Production KAFKA (CLUSTER) ANALYTICS SERVER DRUID CLUSTER CASSANDRA AD PLAYER
  24. 24. Druid’s Reliability Check KAFKA (CLUSTER) ANALYTICS SERVER DRUID CLUSTER RAW FILE CONSUMER RAW FILES RAW FILES RAW FILES Job To Test Druid’s Integrity AD PLAYER
  25. 25. A Quick Demo
  26. 26. Druid Architecture DEEP STORAGE DEEP STORAGE ZOOKEEPERZOOKEEPER Druid Nodes External Dependencies Queries MetaData Data/Segments Client Queries Streaming Data REAL TIME NODES COORDINATOR NODES HISTORICAL NODES BROKER NODES MY SQLMY SQL
  27. 27. Druid Data Ingestion DEEP STORAGE DEEP STORAGE ZOOKEEPERZOOKEEPER Druid Nodes External Dependencies Queries MetaData Data/Segments Client Queries Streaming Data REAL TIME NODES COORDINATOR NODES HISTORICAL NODES BROKER NODES MY SQLMY SQL
  28. 28. Druid Data Ingestion (Our System) KAFKA (CLUSTER) DRUID Real- time Node ANALYTICS SERVER AD PLAYER
  29. 29. Druid Data Retrieval DEEP STORAGE DEEP STORAGE ZOOKEEPERZOOKEEPER Druid Nodes External Dependencies Queries MetaData Data/Segments Client Queries Streaming Data REAL TIME NODES COORDINATOR NODES HISTORICAL NODES BROKER NODES MY SQLMY SQL
  30. 30. Coordinator Nodes DEEP STORAGE DEEP STORAGE ZOOKEEPERZOOKEEPER Druid Nodes External Dependencies Queries MetaData Data/Segments Client Queries Streaming Data REAL TIME NODES COORDINATOR NODES HISTORICAL NODES BROKER NODES MY SQLMY SQL
  31. 31. Druid Data Segment Propagation DEEP STORAGE DEEP STORAGE ZOOKEEPERZOOKEEPER Druid Nodes External Dependencies Queries MetaData Data/Segments Streaming Data REAL TIME NODES COORDINATOR NODES HISTORICAL NODES MY SQLMY SQL
  32. 32. Our Production Stats •Over 200 million events per day – ingested into Druid cluster •4 boxes with 8 cores, 64GB RAM, 1TB SSD •2 coordinator nodes (only one master) •2 real-time nodes •4 historical nodes (on each box)
  33. 33. Companies Using Druid
  34. 34. Questions ?

×