Case Study: Realtime Analytics with Druid

Case Study: Real-time Analytics
With Druid
Salil Kalia, Tech Lead, TO THE NEW Digital

About Presenter
• Over 10 years in software industry
• Working with TO THE NEW Digital since 2009
• Using mainly Java/Groovy/Grails eco-systems
for the development purpose
• Working on Digital marketing domain for the
last few years
• Cassandra certified trainer
• Loves traveling and exploring new places

Agenda
Understanding the use-case
• Ad workflow
• Our use case
Experiments with technologies
• Redis
• Cassandra
Introduction to Druid
• Architecture
• Druid in production
• Demo

Understanding The Ad Workflow
AD
AGENCY-2
AD
AGENCY-2
AD
AGENCY-3
AD
AGENCY-3
AD
AGENCY-1
AD
AGENCY-1
USER
Web Page
Request
Ad
Request
Ad-Content
PUBLISHER
SERVER
AD
EXCHANGE

Examples From Our Use Case
•How many times a video has been viewed ?
•How many times a video has been viewed in a particular
time-span ?
time-span at a particular site ?
time-span at a particular site in a particular country ?
time-span at a particular site in a particular country on a
particular device ?

Video Events For The Analysis
• LOAD
• START
• PLAYING
• VIEW
• STOP / PAUSE
• FINISH

Event Data (Sample)
TIMESTAMP Ad Site Advertiser Event Action
2011-01-01T01:01:27 Z 123 abc.com Brand X Player Load
2011-01-01T01:01:33 Z 234 abcd.com Brand Y Player Load
2011-01-01T01:01:40 Z 123 abc.com Brand X Player Start
2011-01-01T01:01:45 Z 123 abc.com Brand X Player Playing
2011-01-01T01:01:50 Z 123 abc.com Brand Y Player Playing
2011-01-01T01:01:51 Z 123 abc.com Brand X Player Stop

What Is Analytics ?
Processing the HISTORICAL data to:
•Understand potential trends
•Analyze the effects of certain decisions or events
•Evaluate the performance of a system
•Make better business decisions

Why (We Need) Real-time Analytics ?
• Understand the real-time performance
• Control the velocity
• Avoid over serving
• Avoid under serving
• Control the targeting

Recap – Things We Understood
• How the ad-tech works (in general)
• Our use-case
• Different video player events
• We are expecting a huge amount of data coming
at a very high velocity.

Why We Picked Redis
• Great buzz in the market
• Highly scalable
• Easy to setup, configure and use
• We were not very clear with our use-case

Realizations From Redis
• Not a good fit to deal with time-series (big) data
• Persistence is another issue – we can’t afford
loosing data
• There was a huge variety of keys all over the
place
• Complexity in the (application side) code started
increasing

Working With Cassandra
• Very good support for the time-series data
• Extremely good for writing the data at a very
high speed
• Very easy to scale horizontally
• Supports aggregations through Counters

Writing into Cassandra
ANALYTICS
SERVER
CASSANDRA
AD PLAYER

Reading from Cassandra
ANALYTICS
SERVER CASSANDRA
CAMPAIGN
MANAGER

What didn’t work with Cassandra
• Inconsistent results
• Unreliable counters
• No ad-hoc queries support
• Nodes were crashing out very frequently

Crossroads – What next ?
• Third party tools on the top of Cassandra for
better consistency
• DataStax Enterprise edition
• Taking a deeper dive into Cassandra to
reconfigure the whole architecture and setup
• Switching to different technology

About Druid (http://druid.io)
• An open-source analytics data store
• Supports streaming - data ingestion
• Flexible filters for ad-hoc queries
• Fast aggregations – sub second queries
• Distributed, shared-nothing architecture
• Easily scalable

Setting Up Druid In Production
KAFKA
(CLUSTER)
ANALYTICS
SERVER
DRUID
CLUSTER
CASSANDRA
AD PLAYER

Druid’s Reliability Check
KAFKA
(CLUSTER)
ANALYTICS
SERVER
DRUID
CLUSTER
RAW FILE
CONSUMER
RAW
FILES
RAW
FILES
RAW
FILES
Job To Test
Druid’s
Integrity
AD PLAYER

Druid Architecture
DEEP
STORAGE
DEEP
STORAGE
ZOOKEEPERZOOKEEPER
Druid Nodes
External
Dependencies
Queries
MetaData
Data/Segments
Client
Queries
Streaming
Data
REAL
TIME
NODES
COORDINATOR
NODES
HISTORICAL
NODES
BROKER
NODES
MY SQLMY SQL

Druid Data Ingestion
DEEP
STORAGE
DEEP
STORAGE
ZOOKEEPERZOOKEEPER
Druid Nodes
External
Dependencies
Queries
MetaData
Data/Segments
Client
Queries
Streaming
Data
REAL
TIME
NODES
COORDINATOR
NODES
HISTORICAL
NODES
BROKER
NODES
MY SQLMY SQL

Druid Data Ingestion (Our System)
KAFKA
(CLUSTER)
DRUID
Real-
time
Node
ANALYTICS
SERVER
AD PLAYER

Druid Data Retrieval
DEEP
STORAGE
DEEP
STORAGE
ZOOKEEPERZOOKEEPER
Druid Nodes
External
Dependencies
Queries
MetaData
Data/Segments
Client
Queries
Streaming
Data
REAL
TIME
NODES
COORDINATOR
NODES
HISTORICAL
NODES
BROKER
NODES
MY SQLMY SQL

Coordinator Nodes
DEEP
STORAGE
DEEP
STORAGE
ZOOKEEPERZOOKEEPER
Druid Nodes
External
Dependencies
Queries
MetaData
Data/Segments
Client
Queries
Streaming
Data
REAL
TIME
NODES
COORDINATOR
NODES
HISTORICAL
NODES
BROKER
NODES
MY SQLMY SQL

Druid Data Segment Propagation
DEEP
STORAGE
DEEP
STORAGE
ZOOKEEPERZOOKEEPER
Druid Nodes
External
Dependencies
Queries
MetaData
Data/Segments
Streaming
Data
REAL
TIME
NODES
COORDINATOR
NODES
HISTORICAL
NODES
MY SQLMY SQL

Our Production Stats
•Over 200 million events per day – ingested into
Druid cluster
•4 boxes with 8 cores, 64GB RAM, 1TB SSD
•2 coordinator nodes (only one master)
•2 real-time nodes
•4 historical nodes (on each box)

Case Study: Realtime Analytics with Druid

Case Study: Realtime Analytics with Druid

More Related Content

What's hot

Viewers also liked

Similar to Case Study: Realtime Analytics with Druid

Recently uploaded

Case Study: Realtime Analytics with Druid

Editor's Notes