The case study is about ViralGains - a US based video marketing platform. The presentation was delivered by me (Salil Kalia) at Great Indian Developer Summit (GIDS) 2016. This is a piece of a great work that we have done at TO THE NEW Digital with our customer, ViralGains.
Here, I show-cased Druid (http://druid.io) and the supporting technologies (Kafka/Zookeeper) to demonstrate how it helped us in building a stable realtime analytics system, in capturing hundreds of millions of analytics events per day. When it comes to Ad industry - it becomes very important to be precise or close to precision because money is involved at every step (even for a single ad impression).
The case study included a demo and a short talk on their journey of moving from Redis to Cassandra and finally ending up on Druid with an outstanding performance.
1. Case Study: Real-time Analytics
With Druid
Salil Kalia, Tech Lead, TO THE NEW Digital
2. About Presenter
• Over 10 years in software industry
• Working with TO THE NEW Digital since 2009
• Using mainly Java/Groovy/Grails eco-systems
for the development purpose
• Working on Digital marketing domain for the
last few years
• Cassandra certified trainer
• Loves traveling and exploring new places
3. Agenda
Understanding the use-case
• Ad workflow
• Our use case
Experiments with technologies
• Redis
• Cassandra
Introduction to Druid
• Architecture
• Druid in production
• Demo
5. Understanding The Ad Workflow
AD
AGENCY-2
AD
AGENCY-2
AD
AGENCY-3
AD
AGENCY-3
AD
AGENCY-1
AD
AGENCY-1
USER
Web Page
Request
Ad
Request
Ad-Content
PUBLISHER
SERVER
AD
EXCHANGE
6. Examples From Our Use Case
•How many times a video has been viewed ?
•How many times a video has been viewed in a particular
time-span ?
•How many times a video has been viewed in a particular
time-span at a particular site ?
•How many times a video has been viewed in a particular
time-span at a particular site in a particular country ?
•How many times a video has been viewed in a particular
time-span at a particular site in a particular country on a
particular device ?
7. Video Events For The Analysis
• LOAD
• START
• PLAYING
• VIEW
• STOP / PAUSE
• FINISH
8. Event Data (Sample)
TIMESTAMP Ad Site Advertiser Event Action
2011-01-01T01:01:27 Z 123 abc.com Brand X Player Load
2011-01-01T01:01:33 Z 234 abcd.com Brand Y Player Load
2011-01-01T01:01:40 Z 123 abc.com Brand X Player Start
2011-01-01T01:01:45 Z 123 abc.com Brand X Player Playing
2011-01-01T01:01:50 Z 123 abc.com Brand Y Player Playing
2011-01-01T01:01:51 Z 123 abc.com Brand X Player Stop
9. What Is Analytics ?
Processing the HISTORICAL data to:
•Understand potential trends
•Analyze the effects of certain decisions or events
•Evaluate the performance of a system
•Make better business decisions
11. Why (We Need) Real-time Analytics ?
• Understand the real-time performance
• Control the velocity
• Avoid over serving
• Avoid under serving
• Control the targeting
12. Recap – Things We Understood
• How the ad-tech works (in general)
• Our use-case
• Different video player events
• We are expecting a huge amount of data coming
at a very high velocity.
14. Why We Picked Redis
• Great buzz in the market
• Highly scalable
• Easy to setup, configure and use
• We were not very clear with our use-case
15. Realizations From Redis
• Not a good fit to deal with time-series (big) data
• Persistence is another issue – we can’t afford
loosing data
• There was a huge variety of keys all over the
place
• Complexity in the (application side) code started
increasing
16. Working With Cassandra
• Very good support for the time-series data
• Extremely good for writing the data at a very
high speed
• Very easy to scale horizontally
• Supports aggregations through Counters
19. What didn’t work with Cassandra
• Inconsistent results
• Unreliable counters
• No ad-hoc queries support
• Nodes were crashing out very frequently
20. Crossroads – What next ?
• Third party tools on the top of Cassandra for
better consistency
• DataStax Enterprise edition
• Taking a deeper dive into Cassandra to
reconfigure the whole architecture and setup
• Switching to different technology
22. About Druid (http://druid.io)
• An open-source analytics data store
• Supports streaming - data ingestion
• Flexible filters for ad-hoc queries
• Fast aggregations – sub second queries
• Distributed, shared-nothing architecture
• Easily scalable
23. Setting Up Druid In Production
KAFKA
(CLUSTER)
ANALYTICS
SERVER
DRUID
CLUSTER
CASSANDRA
AD PLAYER
32. Druid Data Segment Propagation
DEEP
STORAGE
DEEP
STORAGE
ZOOKEEPERZOOKEEPER
Druid Nodes
External
Dependencies
Queries
MetaData
Data/Segments
Streaming
Data
REAL
TIME
NODES
COORDINATOR
NODES
HISTORICAL
NODES
MY SQLMY SQL
33. Our Production Stats
•Over 200 million events per day – ingested into
Druid cluster
•4 boxes with 8 cores, 64GB RAM, 1TB SSD
•2 coordinator nodes (only one master)
•2 real-time nodes
•4 historical nodes (on each box)
So the case study is about one of our potential customer (based out of US) – ViralGains
ViralGains is a Viral Video Marketing platform. Basically runs campaigns for different ad-agencies or brands.
Let’s see an overview of ad world.
5m
Our role here
Requests at high velocity
Usually, 2 million requests per minute
10m
In order to do so, What do we need to do now?
Orange highlights are filters
12m
Now we have data – it’s time to analyze it.
Everybody knows, but just to revive the fundamentals
14
Before going ahead on the real-time analytics part – let’s see how an ad-tech works.
16m
17m
Redis
Cassandra
Druid
18m
Not talking about “What is Redis and all”. Assuming that you already know about this!!
Using Redis was like a Too early decision
20m
Explain (one liners):
Why inconsistent results – delays in the sync process
distributed counters
Very much possible – that we couldn’t configure it very well
24m
25m
27m
What are filters
Not explaining – why fast aggregations
29m
What is Kafka (in between) ?
Coloring strategy
35m
35m
45m
50m
External dependencies
Realtime node vs Historical node (data lies here only)
What is Deep Storage – S3
MySQL for meta-data
What are segments
53m
55m
# Events hits a processing node called realtime, it indexes it (means compacts it) does the aggregation and then data is ready for querying within no time
# Once it indexed it can offload the older data indexed data to deep storage, so that it can process new data and we don't overwhelm it.
# It pushes to deep storage generally (s3) and writes to meta storage that I want to offload.
# Coordinator sees that and ask historical to read from s3 and broad case once done with loading into memory.
# Once historical says ready for query, it broadcasts that i am ready, then realtime offloads