Case Study: Real-time Analytics
With Druid
Salil Kalia, Tech Lead, TO THE NEW Digital
About Presenter
• Over 10 years in software industry
• Working with TO THE NEW Digital since 2009
• Using mainly Java/Groovy/Grails eco-systems
for the development purpose
• Working on Digital marketing domain for the
last few years
• Cassandra certified trainer
• Loves traveling and exploring new places
Agenda
Understanding the use-case
• Ad workflow
• Our use case
Experiments with technologies
• Redis
• Cassandra
Introduction to Druid
• Architecture
• Druid in production
• Demo
Understanding the
use-case
Understanding The Ad Workflow
AD
AGENCY-2
AD
AGENCY-2
AD
AGENCY-3
AD
AGENCY-3
AD
AGENCY-1
AD
AGENCY-1
USER
Web Page
Request
Ad
Request
Ad-Content
PUBLISHER
SERVER
AD
EXCHANGE
Examples From Our Use Case
•How many times a video has been viewed ?
•How many times a video has been viewed in a particular
time-span ?
•How many times a video has been viewed in a particular
time-span at a particular site ?
•How many times a video has been viewed in a particular
time-span at a particular site in a particular country ?
•How many times a video has been viewed in a particular
time-span at a particular site in a particular country on a
particular device ?
Video Events For The Analysis
• LOAD
• START
• PLAYING
• VIEW
• STOP / PAUSE
• FINISH
Event Data (Sample)
TIMESTAMP Ad Site Advertiser Event Action
2011-01-01T01:01:27 Z 123 abc.com Brand X Player Load
2011-01-01T01:01:33 Z 234 abcd.com Brand Y Player Load
2011-01-01T01:01:40 Z 123 abc.com Brand X Player Start
2011-01-01T01:01:45 Z 123 abc.com Brand X Player Playing
2011-01-01T01:01:50 Z 123 abc.com Brand Y Player Playing
2011-01-01T01:01:51 Z 123 abc.com Brand X Player Stop
What Is Analytics ?
Processing the HISTORICAL data to:
•Understand potential trends
•Analyze the effects of certain decisions or events
•Evaluate the performance of a system
•Make better business decisions
What Is Real-time Analytics ?
Why (We Need) Real-time Analytics ?
• Understand the real-time performance
• Control the velocity
• Avoid over serving
• Avoid under serving
• Control the targeting
Recap – Things We Understood
• How the ad-tech works (in general)
• Our use-case
• Different video player events
• We are expecting a huge amount of data coming
at a very high velocity.
Experiments with
technologies
Why We Picked Redis
• Great buzz in the market
• Highly scalable
• Easy to setup, configure and use
• We were not very clear with our use-case
Realizations From Redis
• Not a good fit to deal with time-series (big) data
• Persistence is another issue – we can’t afford
loosing data
• There was a huge variety of keys all over the
place
• Complexity in the (application side) code started
increasing
Working With Cassandra
• Very good support for the time-series data
• Extremely good for writing the data at a very
high speed
• Very easy to scale horizontally
• Supports aggregations through Counters
Writing into Cassandra
ANALYTICS
SERVER
CASSANDRA
AD PLAYER
Reading from Cassandra
ANALYTICS
SERVER CASSANDRA
CAMPAIGN
MANAGER
What didn’t work with Cassandra
• Inconsistent results
• Unreliable counters
• No ad-hoc queries support
• Nodes were crashing out very frequently
Crossroads – What next ?
• Third party tools on the top of Cassandra for
better consistency
• DataStax Enterprise edition
• Taking a deeper dive into Cassandra to
reconfigure the whole architecture and setup
• Switching to different technology
Understanding druid
About Druid (http://druid.io)
• An open-source analytics data store
• Supports streaming - data ingestion
• Flexible filters for ad-hoc queries
• Fast aggregations – sub second queries
• Distributed, shared-nothing architecture
• Easily scalable
Setting Up Druid In Production
KAFKA
(CLUSTER)
ANALYTICS
SERVER
DRUID
CLUSTER
CASSANDRA
AD PLAYER
Druid’s Reliability Check
KAFKA
(CLUSTER)
ANALYTICS
SERVER
DRUID
CLUSTER
RAW FILE
CONSUMER
RAW
FILES
RAW
FILES
RAW
FILES
Job To Test
Druid’s
Integrity
AD PLAYER
A Quick Demo
Druid Architecture
DEEP
STORAGE
DEEP
STORAGE
ZOOKEEPERZOOKEEPER
Druid Nodes
External
Dependencies
Queries
MetaData
Data/Segments
Client
Queries
Streaming
Data
REAL
TIME
NODES
COORDINATOR
NODES
HISTORICAL
NODES
BROKER
NODES
MY SQLMY SQL
Druid Data Ingestion
DEEP
STORAGE
DEEP
STORAGE
ZOOKEEPERZOOKEEPER
Druid Nodes
External
Dependencies
Queries
MetaData
Data/Segments
Client
Queries
Streaming
Data
REAL
TIME
NODES
COORDINATOR
NODES
HISTORICAL
NODES
BROKER
NODES
MY SQLMY SQL
Druid Data Ingestion (Our System)
KAFKA
(CLUSTER)
DRUID
Real-
time
Node
ANALYTICS
SERVER
AD PLAYER
Druid Data Retrieval
DEEP
STORAGE
DEEP
STORAGE
ZOOKEEPERZOOKEEPER
Druid Nodes
External
Dependencies
Queries
MetaData
Data/Segments
Client
Queries
Streaming
Data
REAL
TIME
NODES
COORDINATOR
NODES
HISTORICAL
NODES
BROKER
NODES
MY SQLMY SQL
Coordinator Nodes
DEEP
STORAGE
DEEP
STORAGE
ZOOKEEPERZOOKEEPER
Druid Nodes
External
Dependencies
Queries
MetaData
Data/Segments
Client
Queries
Streaming
Data
REAL
TIME
NODES
COORDINATOR
NODES
HISTORICAL
NODES
BROKER
NODES
MY SQLMY SQL
Druid Data Segment Propagation
DEEP
STORAGE
DEEP
STORAGE
ZOOKEEPERZOOKEEPER
Druid Nodes
External
Dependencies
Queries
MetaData
Data/Segments
Streaming
Data
REAL
TIME
NODES
COORDINATOR
NODES
HISTORICAL
NODES
MY SQLMY SQL
Our Production Stats
•Over 200 million events per day – ingested into
Druid cluster
•4 boxes with 8 cores, 64GB RAM, 1TB SSD
•2 coordinator nodes (only one master)
•2 real-time nodes
•4 historical nodes (on each box)
Companies Using Druid
Questions ?
Case Study: Realtime Analytics with Druid

Case Study: Realtime Analytics with Druid

Editor's Notes

  • #4 5m
  • #5 So the case study is about one of our potential customer (based out of US) – ViralGains ViralGains is a Viral Video Marketing platform. Basically runs campaigns for different ad-agencies or brands. Let’s see an overview of ad world.
  • #6 5m Our role here Requests at high velocity Usually, 2 million requests per minute
  • #7 10m In order to do so, What do we need to do now? Orange highlights are filters
  • #9 12m Now we have data – it’s time to analyze it.
  • #10 Everybody knows, but just to revive the fundamentals
  • #11 14 Before going ahead on the real-time analytics part – let’s see how an ad-tech works.
  • #12 16m
  • #13 17m
  • #14 Redis Cassandra Druid
  • #15 18m Not talking about “What is Redis and all”. Assuming that you already know about this!! Using Redis was like a Too early decision
  • #17 20m
  • #20 Explain (one liners): Why inconsistent results – delays in the sync process distributed counters Very much possible – that we couldn’t configure it very well
  • #21 24m
  • #22 25m
  • #23 27m What are filters Not explaining – why fast aggregations
  • #24 29m What is Kafka (in between) ? Coloring strategy
  • #25 35m
  • #26 35m
  • #27 45m
  • #28 50m External dependencies Realtime node vs Historical node (data lies here only) What is Deep Storage – S3 MySQL for meta-data What are segments
  • #32 53m
  • #33 55m # Events hits a processing node called realtime, it indexes it (means compacts it) does the aggregation and then data is ready for querying within no time # Once it indexed it can offload the older data indexed data to deep storage, so that it can process new data and we don't overwhelm it. # It pushes to deep storage generally (s3) and writes to meta storage that I want to offload. # Coordinator sees that and ask historical to read from s3 and broad case once done with loading into memory. # Once historical says ready for query, it broadcasts that i am ready, then realtime offloads
  • #34 55m