Real-time analytics with Druid at Appsflyer

Meet Druid!
Real-time analytics with Druid at
Appsflyer

Publisher
Click
Install
Appsflyer Flow
Advertiser

Appsflyer as Marketing Platform
Fraud detection
Statistics
Attribution
Life time value
Retargeting
Prediction
A/B testing

Appsflyer Technology
● ~8B events / day
● Hundreds of machines in Amazon
● Tens of micro-services
Apache Kafka
service
service
service
service
service
service
DB
Amazon S3
MongoDB
Redshift
Druid

Realtime
● New buzzword
● Ingestion latency - seconds
● Query latency - seconds

Analytics
● Roll-up
○ Summarizing over a dimension
● Drill-down
○ Focusing (zooming in)
● Slicing and dicing
○ Reducing dimensions (slice)
○ Picking values of specific dimensions (dice)
● Pivoting
○ Rotating multi-dimensional cube

We tried...
● MongoDB
○ Operational issues
○ Performance is not great
● Redshift
○ Concurrency limits
● Aurora (MySQL)
○ Aggregations are not optimized
● Memsql
○ Insufficient performance
○ Too pricy
● Cassandra
○ Not flexible

Druid
● Storage optimized for analytics
● Lambda architecture inside
● JSON-based query language
● Developed by analytics SAAS company
● Free and open source
● Scalable to petabytes...

Druid Storage
● Columnar
● Inverted index
● Immutable segments

Columnar Storage
Original data: 100MB
Queried columns:
10MB
Compressed:
3MB

Index
● Values are dictionary encoded
{“USA” -> 1, “Canada” -> 2, “Mexico” -> 3, …}
● Bitmap for every dimension value (used by filters)
“USA” -> [0, 1, 0, 0, 1, 1, 0, 0, 0]
● Column values (used by aggregation queries)
[2, 1, 3, 15, 1, 1, 2, 8, 7]

Data Segments
● Per time interval
○ Skip segments when querying
● Immutable
○ Cache friendly
○ No locking
● Versioned (MVCC)
○ No locking
○ Read-write concurrency

Data Ingestion
Real-time Data Historical Data
Broker
Streaming Hand-off
Batch indexing
Query

Real-time Ingestion
● Via Real-Time Node and Firehose
○ No redundancy or HA, thus not recommended
● Via Indexing Service and Tranquility API
○ Core API
○ Integrations with Streaming Frameworks
○ HTTP Server
○ Kafka Consumer

Batch Ingestion
● File based (HDFS, S3, …)
● Indexers
○ Internal Indexer
■ For datasets < 1G
○ External Hadoop Cluster
○ Spark Indexer
■ Work in progress

Ingestion Spec
● Parsing configuration (Flat JSON, *SV)
● Dimensions
● Metrics
● Granularity
○ Segment granularity
○ Query granularity
● I/O configuration
○ Where to read data from
● Tuning configuration
○ Indexer tuning
● Partitioning and replication

Real-time ingestion
Task 1
Task 2
Interval Window
Time
Minimum indexing slots = Data sources x Partitions x Replicas x 2

Query Types
● Group by
○ grouping by multiple dimensions
● Top N
○ like grouping by a single dimension
● Timeseries
○ w/o grouping over dimensions
● Search
○ Dimensions lookup
● Time boundary
○ Find available data timeframe
● Metadata queries

Tips for Querying
● Prefer topN over groupBy
● Prefer timeseries over topN and groupBy
● Use limits (and priorities)

Query Spec
● Data source
● Dimensions
● Interval
● Filters
● Aggregations
● Post aggregations
● Granularity
● Context (query configuration)
● Limit

Sample Query
~# curl -X POST -d@query.json -H "Content-Type: application/json" http://druidbroker:8082/druid/v2?pretty
{
"queryType": "groupBy",
"dataSource": "inappevents",
"granularity": "hour",
"dimensions": ["media_source", "campaign"],
"filter": {
"type": "and", "fields": [{ "type": "selector", "dimension": "app_id", "value": "com.comuto" },
{ "type": "selector", "dimension": "country", "value": "RU" }]
},
"aggregations": [
{ "type": "count", "name": "events_count" },
{ "type": "doubleSum", "name": "revenue", "fieldName": "monetary" }
],
"intervals": [ "2015-12-01T00:00:00.000/2016-01-01T00:00:00.000" ]
}

Caching
● Historical node level
○ By segment
● Broker level
○ By segment and query
○ “groupBy” is disabled on purpose!
● By default - local caching
● In production - use memcached

Load Rules
● Can be defined
○ On data source
○ On “tier”
● What can be set
○ Replication factor
○ Load period
○ Drop period
● Can be used to separate “hot” data from “cold” one

Druid Components
Historical Nodes
Real-time Nodes
Coordinator
Middle Manager
Overlord
Indexing Service
Broker Nodes
Deep
Storage
Metadata Storage

Druid Components
Historical Nodes
Real-time Nodes
Coordinator
Middle Manager
Overlord
Indexing Service
Broker Nodes
Deep
Storage
Metadata Storage
Cache
Load Balancer

Druid Components (Explained)
● Coordinator
○ Manages segments
● Real-time Nodes
○ Pulling data in real-time, and indexing it
● Historical Nodes
○ Keeps historical segments
● Overlord
○ Accepts tasks and distributes them to Middle Managers
● Middle Manager
○ Executes submitted tasks via Peons
● Broker Nodes
○ Routes query to Real-time and Historical nodes, merges results
● Deep Storage
○ Segments backup (HDFS, S3, …)

Failover
● Coordinator and Overlord
○ HA
● Real-time nodes
○ Tasks are replicated
○ Pool of nodes
● Historical nodes
○ Data is replicated
○ Pool of nodes
○ All segments are backed up in the deep storage
● Brokers
○ Pool of nodes
○ Load balancer at the front

Druid at Appsflyer
Druid Sink
S3S3

Druid Sink
Druid Sink
Tranquility API
Probably not needed anymore due to native support in Tranquility package

Druid in Production
● Provisioning using Chef
● r3.8xlarge (sample configuration is OK)
● Redundancy for coordinator and overlord (node per AZ)
● Historical and real-time nodes are spread between AZ
● LB - Consul from Hashicorp
● Service discovery - Consul again
● Memcached
● Monitoring via Graphite Emitter extension
○ https://github.com/druid-io/druid/pull/1978
● Alerting via Sensu

IAP Distribution
● 3 different node types (instead of 6)
● Unpack and run
● Some useful wrappers
● Built-in examples for quick start
● Commercial support
● PyQL, Pivot inside
http://imply.io

Tips
● ZooKeeper is heavily used
○ Choose appropriate hardware/network for ZK machines
● Use latest version (0.8.3)
○ Restartable tasks
○ Indexing time improvement! (https://github.com/druid-io/druid/pull/1960)
○ Data sketches library
● All exceptions are useful

When Not to Choose Druid?
● When data is not time-series
● When data cardinality is high
● When number of output rows is high
● When setup costs must be avoided

Non-time Series Workarounds
● Must have some timestamp still
● Rebuild everything to order by your timestamp
● Or, use single-dimension partitioning
○ Segments partitioned by timestamp first, then by dimension range
○ Find optimal target segment size
Still, please don’t use Druid for non-time series!

Real-time analytics with Druid at Appsflyer

More Related Content

What's hot

Viewers also liked

Similar to Real-time analytics with Druid at Appsflyer

Recently uploaded

Real-time analytics with Druid at Appsflyer