3. Appsflyer as Marketing Platform
Fraud detection
Statistics
Attribution
Life time value
Retargeting
Prediction
A/B testing
4. Appsflyer Technology
● ~8B events / day
● Hundreds of machines in Amazon
● Tens of micro-services
Apache Kafka
service
service
service
service
service
service
DB
Amazon S3
MongoDB
Redshift
Druid
8. We tried...
● MongoDB
○ Operational issues
○ Performance is not great
● Redshift
○ Concurrency limits
● Aurora (MySQL)
○ Aggregations are not optimized
● Memsql
○ Insufficient performance
○ Too pricy
● Cassandra
○ Not flexible
9. Druid
● Storage optimized for analytics
● Lambda architecture inside
● JSON-based query language
● Developed by analytics SAAS company
● Free and open source
● Scalable to petabytes...
12. Index
● Values are dictionary encoded
{“USA” -> 1, “Canada” -> 2, “Mexico” -> 3, …}
● Bitmap for every dimension value (used by filters)
“USA” -> [0, 1, 0, 0, 1, 1, 0, 0, 0]
● Column values (used by aggregation queries)
[2, 1, 3, 15, 1, 1, 2, 8, 7]
13. Data Segments
● Per time interval
○ Skip segments when querying
● Immutable
○ Cache friendly
○ No locking
● Versioned (MVCC)
○ No locking
○ Read-write concurrency
15. Real-time Ingestion
● Via Real-Time Node and Firehose
○ No redundancy or HA, thus not recommended
● Via Indexing Service and Tranquility API
○ Core API
○ Integrations with Streaming Frameworks
○ HTTP Server
○ Kafka Consumer
16. Batch Ingestion
● File based (HDFS, S3, …)
● Indexers
○ Internal Indexer
■ For datasets < 1G
○ External Hadoop Cluster
○ Spark Indexer
■ Work in progress
17. Ingestion Spec
● Parsing configuration (Flat JSON, *SV)
● Dimensions
● Metrics
● Granularity
○ Segment granularity
○ Query granularity
● I/O configuration
○ Where to read data from
● Tuning configuration
○ Indexer tuning
● Partitioning and replication
19. Query Types
● Group by
○ grouping by multiple dimensions
● Top N
○ like grouping by a single dimension
● Timeseries
○ w/o grouping over dimensions
● Search
○ Dimensions lookup
● Time boundary
○ Find available data timeframe
● Metadata queries
20. Tips for Querying
● Prefer topN over groupBy
● Prefer timeseries over topN and groupBy
● Use limits (and priorities)
23. Caching
● Historical node level
○ By segment
● Broker level
○ By segment and query
○ “groupBy” is disabled on purpose!
● By default - local caching
● In production - use memcached
24. Load Rules
● Can be defined
○ On data source
○ On “tier”
● What can be set
○ Replication factor
○ Load period
○ Drop period
● Can be used to separate “hot” data from “cold” one
27. Druid Components (Explained)
● Coordinator
○ Manages segments
● Real-time Nodes
○ Pulling data in real-time, and indexing it
● Historical Nodes
○ Keeps historical segments
● Overlord
○ Accepts tasks and distributes them to Middle Managers
● Middle Manager
○ Executes submitted tasks via Peons
● Broker Nodes
○ Routes query to Real-time and Historical nodes, merges results
● Deep Storage
○ Segments backup (HDFS, S3, …)
28. Failover
● Coordinator and Overlord
○ HA
● Real-time nodes
○ Tasks are replicated
○ Pool of nodes
● Historical nodes
○ Data is replicated
○ Pool of nodes
○ All segments are backed up in the deep storage
● Brokers
○ Pool of nodes
○ Load balancer at the front
31. Druid in Production
● Provisioning using Chef
● r3.8xlarge (sample configuration is OK)
● Redundancy for coordinator and overlord (node per AZ)
● Historical and real-time nodes are spread between AZ
● LB - Consul from Hashicorp
● Service discovery - Consul again
● Memcached
● Monitoring via Graphite Emitter extension
○ https://github.com/druid-io/druid/pull/1978
● Alerting via Sensu
32. IAP Distribution
● 3 different node types (instead of 6)
● Unpack and run
● Some useful wrappers
● Built-in examples for quick start
● Commercial support
● PyQL, Pivot inside
http://imply.io
33. Tips
● ZooKeeper is heavily used
○ Choose appropriate hardware/network for ZK machines
● Use latest version (0.8.3)
○ Restartable tasks
○ Indexing time improvement! (https://github.com/druid-io/druid/pull/1960)
○ Data sketches library
● All exceptions are useful
34. When Not to Choose Druid?
● When data is not time-series
● When data cardinality is high
● When number of output rows is high
● When setup costs must be avoided
35. Non-time Series Workarounds
● Must have some timestamp still
● Rebuild everything to order by your timestamp
● Or, use single-dimension partitioning
○ Segments partitioned by timestamp first, then by dimension range
○ Find optimal target segment size
Still, please don’t use Druid for non-time series!