The document summarizes Druid, an open source data analytics platform, and how it has enhanced the data platform for a company to enable better business decisions. Key features of Druid include sub-second aggregate queries, real-time analytics dashboards, and live queries for unique users. Druid has helped scale to several hundred terabytes of data with thousands of queries per second while supporting new analytics applications, ad hoc reporting, and exploratory analysis. Future plans include improving the query service and migrating components to technologies like Spark, Flink, Mesos and Docker.
1. Druid @ Branch
Enhancing the Data Platform for better Business
Decisions
● Sub Second aggregate queries
● Real time analytics dashboard
● Live queries for uniques
● Instant exploratory analytics
Technology powering the Data Platform
Performance & Scale Considerations
Opportunity for new Apps
Monitoring
Provisioning & deployment
Future Plan
Demo
Biswajit Das
Data Team
@biswajit @branch.io
Muwon Lum
Infra Team
@muwon @branch.io
2. Agenda
The Business Problem
Technology Gap
Data Platform Features
Performance and Scale
Opportunity for new Apps
Monitoring
Provisioning & deployment
Future Plan
3. The Business Problem
● Cannot perform live complex queries
● Lack of instant access to aggregate data
● Gathering unique impressions time consuming
● No single pane of glass to view all data
● Ad Hoc query requires pre-aggregation
Instant access to information at scale was a problem
4. Technology Gap
Key/Value Store (Aerospike)
● Pre-compute all permutations of possible user queries.
● Range scans on event data.
● Pre-computing all permutations of all ad-hoc queries can lead to a result sets that grow exponentially
with the number of columns of a data sets and can require hours of pre-processing time.
9. Performance And Scale
● 25 node Production cluster (only Druid)
● Several hundred terabytes raw data indexed .
● Typical complex datasource with 30 dimension and 2 metrics
● Real time indexer with ~30k events per second to peak 50k
● Hourly bucketed data to support different timezones
● Sustained 2B + events day
● Thousands of queries per second for online dashboard applications
● Serving 11 million query every day
10. Opportunity for new Apps
● Druid helped us to support new analytics easily .
● Ad hoc reporting .
● Visualizing Data.
● Exploratory analytics .
14. Future Plan
● More robust Query Service .
● Migrate Hadoop indexer to Spark.
● Actively working to migrate streaming pipeline to Flink .
● Evaluating to move whole druid stack to Mesos/Docker .