SlideShare a Scribd company logo
1 of 38
Download to read offline
Meet Druid!
Real-time analytics with Druid at
Appsflyer
Publisher
Click
Install
Appsflyer Flow
Advertiser
Appsflyer as Marketing Platform
Fraud detection
Statistics
Attribution
Life time value
Retargeting
Prediction
A/B testing
Appsflyer Technology
● ~8B events / day
● Hundreds of machines in Amazon
● Tens of micro-services
Apache Kafka
service
service
service
service
service
service
DB
Amazon S3
MongoDB
Redshift
Druid
Realtime
● New buzzword
● Ingestion latency - seconds
● Query latency - seconds
Analytics
● Roll-up
○ Summarizing over a dimension
● Drill-down
○ Focusing (zooming in)
● Slicing and dicing
○ Reducing dimensions (slice)
○ Picking values of specific dimensions (dice)
● Pivoting
○ Rotating multi-dimensional cube
Analytics in 3D
We tried...
● MongoDB
○ Operational issues
○ Performance is not great
● Redshift
○ Concurrency limits
● Aurora (MySQL)
○ Aggregations are not optimized
● Memsql
○ Insufficient performance
○ Too pricy
● Cassandra
○ Not flexible
Druid
● Storage optimized for analytics
● Lambda architecture inside
● JSON-based query language
● Developed by analytics SAAS company
● Free and open source
● Scalable to petabytes...
Druid Storage
● Columnar
● Inverted index
● Immutable segments
Columnar Storage
Original data: 100MB
Queried columns:
10MB
Compressed:
3MB
Index
● Values are dictionary encoded
{“USA” -> 1, “Canada” -> 2, “Mexico” -> 3, …}
● Bitmap for every dimension value (used by filters)
“USA” -> [0, 1, 0, 0, 1, 1, 0, 0, 0]
● Column values (used by aggregation queries)
[2, 1, 3, 15, 1, 1, 2, 8, 7]
Data Segments
● Per time interval
○ Skip segments when querying
● Immutable
○ Cache friendly
○ No locking
● Versioned (MVCC)
○ No locking
○ Read-write concurrency
Data Ingestion
Real-time Data Historical Data
Broker
Streaming Hand-off
Batch indexing
Query
Real-time Ingestion
● Via Real-Time Node and Firehose
○ No redundancy or HA, thus not recommended
● Via Indexing Service and Tranquility API
○ Core API
○ Integrations with Streaming Frameworks
○ HTTP Server
○ Kafka Consumer
Batch Ingestion
● File based (HDFS, S3, …)
● Indexers
○ Internal Indexer
■ For datasets < 1G
○ External Hadoop Cluster
○ Spark Indexer
■ Work in progress
Ingestion Spec
● Parsing configuration (Flat JSON, *SV)
● Dimensions
● Metrics
● Granularity
○ Segment granularity
○ Query granularity
● I/O configuration
○ Where to read data from
● Tuning configuration
○ Indexer tuning
● Partitioning and replication
Real-time ingestion
Task 1
Task 2
Interval Window
Time
Minimum indexing slots = Data sources x Partitions x Replicas x 2
Query Types
● Group by
○ grouping by multiple dimensions
● Top N
○ like grouping by a single dimension
● Timeseries
○ w/o grouping over dimensions
● Search
○ Dimensions lookup
● Time boundary
○ Find available data timeframe
● Metadata queries
Tips for Querying
● Prefer topN over groupBy
● Prefer timeseries over topN and groupBy
● Use limits (and priorities)
Query Spec
● Data source
● Dimensions
● Interval
● Filters
● Aggregations
● Post aggregations
● Granularity
● Context (query configuration)
● Limit
Sample Query
~# curl -X POST -d@query.json -H "Content-Type: application/json" http://druidbroker:8082/druid/v2?pretty
{
"queryType": "groupBy",
"dataSource": "inappevents",
"granularity": "hour",
"dimensions": ["media_source", "campaign"],
"filter": {
"type": "and", "fields": [{ "type": "selector", "dimension": "app_id", "value": "com.comuto" },
{ "type": "selector", "dimension": "country", "value": "RU" }]
},
"aggregations": [
{ "type": "count", "name": "events_count" },
{ "type": "doubleSum", "name": "revenue", "fieldName": "monetary" }
],
"intervals": [ "2015-12-01T00:00:00.000/2016-01-01T00:00:00.000" ]
}
Caching
● Historical node level
○ By segment
● Broker level
○ By segment and query
○ “groupBy” is disabled on purpose!
● By default - local caching
● In production - use memcached
Load Rules
● Can be defined
○ On data source
○ On “tier”
● What can be set
○ Replication factor
○ Load period
○ Drop period
● Can be used to separate “hot” data from “cold” one
Druid Components
Historical Nodes
Real-time Nodes
Coordinator
Middle Manager
Overlord
Indexing Service
Broker Nodes
Deep
Storage
Metadata Storage
Druid Components
Historical Nodes
Real-time Nodes
Coordinator
Middle Manager
Overlord
Indexing Service
Broker Nodes
Deep
Storage
Metadata Storage
Cache
Load Balancer
Druid Components (Explained)
● Coordinator
○ Manages segments
● Real-time Nodes
○ Pulling data in real-time, and indexing it
● Historical Nodes
○ Keeps historical segments
● Overlord
○ Accepts tasks and distributes them to Middle Managers
● Middle Manager
○ Executes submitted tasks via Peons
● Broker Nodes
○ Routes query to Real-time and Historical nodes, merges results
● Deep Storage
○ Segments backup (HDFS, S3, …)
Failover
● Coordinator and Overlord
○ HA
● Real-time nodes
○ Tasks are replicated
○ Pool of nodes
● Historical nodes
○ Data is replicated
○ Pool of nodes
○ All segments are backed up in the deep storage
● Brokers
○ Pool of nodes
○ Load balancer at the front
Druid at Appsflyer
Druid Sink
S3S3
Druid Sink
Druid Sink
Tranquility API
Probably not needed anymore due to native support in Tranquility package
Druid in Production
● Provisioning using Chef
● r3.8xlarge (sample configuration is OK)
● Redundancy for coordinator and overlord (node per AZ)
● Historical and real-time nodes are spread between AZ
● LB - Consul from Hashicorp
● Service discovery - Consul again
● Memcached
● Monitoring via Graphite Emitter extension
○ https://github.com/druid-io/druid/pull/1978
● Alerting via Sensu
IAP Distribution
● 3 different node types (instead of 6)
● Unpack and run
● Some useful wrappers
● Built-in examples for quick start
● Commercial support
● PyQL, Pivot inside
http://imply.io
Tips
● ZooKeeper is heavily used
○ Choose appropriate hardware/network for ZK machines
● Use latest version (0.8.3)
○ Restartable tasks
○ Indexing time improvement! (https://github.com/druid-io/druid/pull/1960)
○ Data sketches library
● All exceptions are useful
When Not to Choose Druid?
● When data is not time-series
● When data cardinality is high
● When number of output rows is high
● When setup costs must be avoided
Non-time Series Workarounds
● Must have some timestamp still
● Rebuild everything to order by your timestamp
● Or, use single-dimension partitioning
○ Segments partitioned by timestamp first, then by dimension range
○ Find optimal target segment size
Still, please don’t use Druid for non-time series!
Tools: Pivot
Tools: Panoramix
Thank you!

More Related Content

What's hot

Building an Observability platform with ClickHouse
Building an Observability platform with ClickHouseBuilding an Observability platform with ClickHouse
Building an Observability platform with ClickHouseAltinity Ltd
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introductionRico Chen
 
Splunk: Druid on Kubernetes with Druid-operator
Splunk: Druid on Kubernetes with Druid-operatorSplunk: Druid on Kubernetes with Druid-operator
Splunk: Druid on Kubernetes with Druid-operatorImply
 
Stream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETStream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETconfluent
 
A Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET DevelopersA Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET DevelopersLuke Tillman
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with SupersetDataWorks Summit
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?Anton Zadorozhniy
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkFaisal Siddiqi
 
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit... Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...Databricks
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model ServingDatabricks
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScaleSeunghyun Lee
 
Altinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Ltd
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsJonas Bonér
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto
 

What's hot (20)

Building an Observability platform with ClickHouse
Building an Observability platform with ClickHouseBuilding an Observability platform with ClickHouse
Building an Observability platform with ClickHouse
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
 
Splunk: Druid on Kubernetes with Druid-operator
Splunk: Druid on Kubernetes with Druid-operatorSplunk: Druid on Kubernetes with Druid-operator
Splunk: Druid on Kubernetes with Druid-operator
 
Stream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETStream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NET
 
Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
 
A Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET DevelopersA Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET Developers
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit... Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
 
Altinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouse
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 

Viewers also liked

Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop EcosystemSlim Bouguerra
 
Interactive analytics at scale with druid
Interactive analytics at scale with druidInteractive analytics at scale with druid
Interactive analytics at scale with druidJulien Lavigne du Cadet
 
Case Study: Realtime Analytics with Druid
Case Study: Realtime Analytics with DruidCase Study: Realtime Analytics with Druid
Case Study: Realtime Analytics with DruidSalil Kalia
 
Data Analytics with Druid
Data Analytics with DruidData Analytics with Druid
Data Analytics with DruidYousun Jeong
 
Druid realtime indexing
Druid realtime indexingDruid realtime indexing
Druid realtime indexingSeoeun Park
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidJan Graßegger
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
Journey to the Real-Time Analytics in Extreme Growth
Journey to the Real-Time Analytics in Extreme GrowthJourney to the Real-Time Analytics in Extreme Growth
Journey to the Real-Time Analytics in Extreme GrowthSingleStore
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)SANG WON PARK
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidCharles Allen
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidDataWorks Summit
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Mobile Moments NYC 2016
Mobile Moments NYC 2016Mobile Moments NYC 2016
Mobile Moments NYC 2016Swrve_Inc
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Rakuten LeoFs - distributed file system
Rakuten LeoFs - distributed file systemRakuten LeoFs - distributed file system
Rakuten LeoFs - distributed file systemRakuten Group, Inc.
 
The Evolution of Mobile Advertising - AppsFlyer Presentation at Israel Moneti...
The Evolution of Mobile Advertising - AppsFlyer Presentation at Israel Moneti...The Evolution of Mobile Advertising - AppsFlyer Presentation at Israel Moneti...
The Evolution of Mobile Advertising - AppsFlyer Presentation at Israel Moneti...AppsFlyer
 

Viewers also liked (20)

Scalable Real-time analytics using Druid
Scalable Real-time analytics using DruidScalable Real-time analytics using Druid
Scalable Real-time analytics using Druid
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
 
Interactive analytics at scale with druid
Interactive analytics at scale with druidInteractive analytics at scale with druid
Interactive analytics at scale with druid
 
Case Study: Realtime Analytics with Druid
Case Study: Realtime Analytics with DruidCase Study: Realtime Analytics with Druid
Case Study: Realtime Analytics with Druid
 
Druid @ branch
Druid @ branch Druid @ branch
Druid @ branch
 
Data Analytics with Druid
Data Analytics with DruidData Analytics with Druid
Data Analytics with Druid
 
Druid realtime indexing
Druid realtime indexingDruid realtime indexing
Druid realtime indexing
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and Druid
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Journey to the Real-Time Analytics in Extreme Growth
Journey to the Real-Time Analytics in Extreme GrowthJourney to the Real-Time Analytics in Extreme Growth
Journey to the Real-Time Analytics in Extreme Growth
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & Druid
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Mobile Moments NYC 2016
Mobile Moments NYC 2016Mobile Moments NYC 2016
Mobile Moments NYC 2016
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
On time-series databases
On time-series databasesOn time-series databases
On time-series databases
 
ebay
ebayebay
ebay
 
Rakuten LeoFs - distributed file system
Rakuten LeoFs - distributed file systemRakuten LeoFs - distributed file system
Rakuten LeoFs - distributed file system
 
The Evolution of Mobile Advertising - AppsFlyer Presentation at Israel Moneti...
The Evolution of Mobile Advertising - AppsFlyer Presentation at Israel Moneti...The Evolution of Mobile Advertising - AppsFlyer Presentation at Israel Moneti...
The Evolution of Mobile Advertising - AppsFlyer Presentation at Israel Moneti...
 

Similar to Real-time analytics with Druid at Appsflyer

NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Databricks
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidImply
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in RetailHari Shreedharan
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanDatabricks
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
OpenSearch.pdf
OpenSearch.pdfOpenSearch.pdf
OpenSearch.pdfAbhi Jain
 
Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)Marcos García
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022StreamNative
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series databasefelixbarny
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...Amazon Web Services
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to CephCeph Community
 

Similar to Real-time analytics with Druid at Appsflyer (20)

Druid
DruidDruid
Druid
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Running Cassandra in AWS
Running Cassandra in AWSRunning Cassandra in AWS
Running Cassandra in AWS
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
OpenSearch.pdf
OpenSearch.pdfOpenSearch.pdf
OpenSearch.pdf
 
Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
 

Recently uploaded

Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 

Recently uploaded (20)

Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 

Real-time analytics with Druid at Appsflyer

  • 1. Meet Druid! Real-time analytics with Druid at Appsflyer
  • 3. Appsflyer as Marketing Platform Fraud detection Statistics Attribution Life time value Retargeting Prediction A/B testing
  • 4. Appsflyer Technology ● ~8B events / day ● Hundreds of machines in Amazon ● Tens of micro-services Apache Kafka service service service service service service DB Amazon S3 MongoDB Redshift Druid
  • 5. Realtime ● New buzzword ● Ingestion latency - seconds ● Query latency - seconds
  • 6. Analytics ● Roll-up ○ Summarizing over a dimension ● Drill-down ○ Focusing (zooming in) ● Slicing and dicing ○ Reducing dimensions (slice) ○ Picking values of specific dimensions (dice) ● Pivoting ○ Rotating multi-dimensional cube
  • 8. We tried... ● MongoDB ○ Operational issues ○ Performance is not great ● Redshift ○ Concurrency limits ● Aurora (MySQL) ○ Aggregations are not optimized ● Memsql ○ Insufficient performance ○ Too pricy ● Cassandra ○ Not flexible
  • 9. Druid ● Storage optimized for analytics ● Lambda architecture inside ● JSON-based query language ● Developed by analytics SAAS company ● Free and open source ● Scalable to petabytes...
  • 10. Druid Storage ● Columnar ● Inverted index ● Immutable segments
  • 11. Columnar Storage Original data: 100MB Queried columns: 10MB Compressed: 3MB
  • 12. Index ● Values are dictionary encoded {“USA” -> 1, “Canada” -> 2, “Mexico” -> 3, …} ● Bitmap for every dimension value (used by filters) “USA” -> [0, 1, 0, 0, 1, 1, 0, 0, 0] ● Column values (used by aggregation queries) [2, 1, 3, 15, 1, 1, 2, 8, 7]
  • 13. Data Segments ● Per time interval ○ Skip segments when querying ● Immutable ○ Cache friendly ○ No locking ● Versioned (MVCC) ○ No locking ○ Read-write concurrency
  • 14. Data Ingestion Real-time Data Historical Data Broker Streaming Hand-off Batch indexing Query
  • 15. Real-time Ingestion ● Via Real-Time Node and Firehose ○ No redundancy or HA, thus not recommended ● Via Indexing Service and Tranquility API ○ Core API ○ Integrations with Streaming Frameworks ○ HTTP Server ○ Kafka Consumer
  • 16. Batch Ingestion ● File based (HDFS, S3, …) ● Indexers ○ Internal Indexer ■ For datasets < 1G ○ External Hadoop Cluster ○ Spark Indexer ■ Work in progress
  • 17. Ingestion Spec ● Parsing configuration (Flat JSON, *SV) ● Dimensions ● Metrics ● Granularity ○ Segment granularity ○ Query granularity ● I/O configuration ○ Where to read data from ● Tuning configuration ○ Indexer tuning ● Partitioning and replication
  • 18. Real-time ingestion Task 1 Task 2 Interval Window Time Minimum indexing slots = Data sources x Partitions x Replicas x 2
  • 19. Query Types ● Group by ○ grouping by multiple dimensions ● Top N ○ like grouping by a single dimension ● Timeseries ○ w/o grouping over dimensions ● Search ○ Dimensions lookup ● Time boundary ○ Find available data timeframe ● Metadata queries
  • 20. Tips for Querying ● Prefer topN over groupBy ● Prefer timeseries over topN and groupBy ● Use limits (and priorities)
  • 21. Query Spec ● Data source ● Dimensions ● Interval ● Filters ● Aggregations ● Post aggregations ● Granularity ● Context (query configuration) ● Limit
  • 22. Sample Query ~# curl -X POST -d@query.json -H "Content-Type: application/json" http://druidbroker:8082/druid/v2?pretty { "queryType": "groupBy", "dataSource": "inappevents", "granularity": "hour", "dimensions": ["media_source", "campaign"], "filter": { "type": "and", "fields": [{ "type": "selector", "dimension": "app_id", "value": "com.comuto" }, { "type": "selector", "dimension": "country", "value": "RU" }] }, "aggregations": [ { "type": "count", "name": "events_count" }, { "type": "doubleSum", "name": "revenue", "fieldName": "monetary" } ], "intervals": [ "2015-12-01T00:00:00.000/2016-01-01T00:00:00.000" ] }
  • 23. Caching ● Historical node level ○ By segment ● Broker level ○ By segment and query ○ “groupBy” is disabled on purpose! ● By default - local caching ● In production - use memcached
  • 24. Load Rules ● Can be defined ○ On data source ○ On “tier” ● What can be set ○ Replication factor ○ Load period ○ Drop period ● Can be used to separate “hot” data from “cold” one
  • 25. Druid Components Historical Nodes Real-time Nodes Coordinator Middle Manager Overlord Indexing Service Broker Nodes Deep Storage Metadata Storage
  • 26. Druid Components Historical Nodes Real-time Nodes Coordinator Middle Manager Overlord Indexing Service Broker Nodes Deep Storage Metadata Storage Cache Load Balancer
  • 27. Druid Components (Explained) ● Coordinator ○ Manages segments ● Real-time Nodes ○ Pulling data in real-time, and indexing it ● Historical Nodes ○ Keeps historical segments ● Overlord ○ Accepts tasks and distributes them to Middle Managers ● Middle Manager ○ Executes submitted tasks via Peons ● Broker Nodes ○ Routes query to Real-time and Historical nodes, merges results ● Deep Storage ○ Segments backup (HDFS, S3, …)
  • 28. Failover ● Coordinator and Overlord ○ HA ● Real-time nodes ○ Tasks are replicated ○ Pool of nodes ● Historical nodes ○ Data is replicated ○ Pool of nodes ○ All segments are backed up in the deep storage ● Brokers ○ Pool of nodes ○ Load balancer at the front
  • 30. Druid Sink Druid Sink Tranquility API Probably not needed anymore due to native support in Tranquility package
  • 31. Druid in Production ● Provisioning using Chef ● r3.8xlarge (sample configuration is OK) ● Redundancy for coordinator and overlord (node per AZ) ● Historical and real-time nodes are spread between AZ ● LB - Consul from Hashicorp ● Service discovery - Consul again ● Memcached ● Monitoring via Graphite Emitter extension ○ https://github.com/druid-io/druid/pull/1978 ● Alerting via Sensu
  • 32. IAP Distribution ● 3 different node types (instead of 6) ● Unpack and run ● Some useful wrappers ● Built-in examples for quick start ● Commercial support ● PyQL, Pivot inside http://imply.io
  • 33. Tips ● ZooKeeper is heavily used ○ Choose appropriate hardware/network for ZK machines ● Use latest version (0.8.3) ○ Restartable tasks ○ Indexing time improvement! (https://github.com/druid-io/druid/pull/1960) ○ Data sketches library ● All exceptions are useful
  • 34. When Not to Choose Druid? ● When data is not time-series ● When data cardinality is high ● When number of output rows is high ● When setup costs must be avoided
  • 35. Non-time Series Workarounds ● Must have some timestamp still ● Rebuild everything to order by your timestamp ● Or, use single-dimension partitioning ○ Segments partitioned by timestamp first, then by dimension range ○ Find optimal target segment size Still, please don’t use Druid for non-time series!