Why data warehouses cannot support hot analytics

© Imply
Why data warehouses cannot support hot analytics
Gian Merlino
CTO and cofounder, Imply
June 24, 2020
1

© Imply© Imply
● Define “temperature-tiered” analytics
● Discuss workflows surrounding “hot analytics”
● Introduce real-time data platforms as a solution
● Discuss recent performance benchmark results
● Discuss how this differentiates from popular cloud data warehouses
What we’ll discuss
2

© Imply© Imply
What is hot analytics?
3
Fast analytics using fresh data for all
● Fast - sub-second query response
● Fresh - streaming, real-time (hot) data
● For all - self-service UX for business people (beyond analysts)

© Imply
Warm
● Most data is available
● Moderate cost
● Moderate performance
Cold
● All data is available
● Low cost
● Not performance sensitive
Not all workloads are equal
Hot
● Business-critical datasets
● Always online
● Latency extremely important
4

© Imply
Hot data powers monitoring and exploration
Drag-and-drop ad-hoc explorationMonitor trends in dashboards
5

© Imply
Where hot analytics is required
Real-time data
warehousing
Metrics
& APM
IoT/timeseriesDigital ads
OLAP queries Network
performance
ClickstreamsRisk/fraud
6

© Imply
Data warehouses cannot support hot analytics
7
Ingestion
Batch processing
Stale
data
Minutes to
hours
Compute
Data lake
Data warehouse
Slow
queries
Seconds to
minutes
Use
Reports
Analytics tools
Specialists
static reporting
10s of seconds

© Imply
Phase I
Lay the groundwork
Phase II
Build the case
Phase III
Make it
defensible
How to add hot analytics
8
Integration Computing Visualization
Stream processing Real-time analytics
database
Self-service analytics
Batch processing
Data lake Reports
Data warehouse Analytics tools

© Imply© Imply
A real-time data platform is required for hot analytics
9
Defining characteristics of a real-time data platform
● Native streaming ingestion and instant data visibility
● Vertically integrated storage, compute and visualization
● Separately scaling ingestion and querying
● Server tiering
● Query prioritization
+ Plus the standard items you expect in a modern analytics platform
+ Cloud-native
+ Elastic
+ Secure
+ Self-healing
+ Zero downtime for software upgrades

© Imply© Imply
Introducing Druid
10
● “high performance”: bread-and-butter fast scan rates + ‘tricks’
● “real-time”: streaming ingestion, interactive query speeds
● “analytics”: counting, ranking, groupBy, time trend
● “database”: the cluster stores a copy of your data and helps you manage it

© Imply© Imply
Introducing Druid
11
● Column oriented
● High concurrency
● Scalable to 1000+ servers
● Continuous, real-time ingest
● Query through SQL
● Target query latency sub-second to a few seconds

© Imply© Imply
Druid vs BigQuery: 3x performance advantage
13
Source: Apache Druid and Google BigQuery Performance Evaluation, 2020

© Imply© Imply
Druid vs BigQuery: 12x price-performance advantage
14
Source: Apache Druid and Google BigQuery Performance Evaluation, 2020
$194,600 monthly
savings at high
concurrency.

© Imply
Comparing Druid and cloud data warehouses
15
Secondary indexes
Cloud-native w/o
speed compromise
● CDWs do not offer indexes beyond the partition key.
● Druid offers space-efficient, compressed secondary indexes.
● CDWs like Snowflake & BigQuery need to retrieve data from remote storage during query execution,
which slows them down.
● Druid preloads data before queries happen.
● Popular CDWs limit latency or throughput of real-time ingestion, if they offer it at all.
● Pull-based ingestion in Druid enables tens of millions of inserts/sec in true real-time.
Pull-based ingestion

© Imply
Comparing Druid and cloud data warehouses
16
Approximate
algorithms
Dynamic lanes
● CDWs offer some approximate algorithms, like count distinct and quantiles.
● Druid offers a wider array of approximate algorithms than any other popular database.
When approximate algorithms are acceptable, they improve performance dramatically.
● CDWs are not designed for running interactive applications.
● Druid uses a ‘fast lane’ to prioritize interactive queries over reporting queries.
Server tiers
● CDWs give you one size that must fit all when it comes to performance and cost.
● Druid lets you control which data gets ‘hot’ vs. ‘warm’ vs. ‘cold’ performance.

© Imply
DWs + BI tools
Our customers have unlocked many new capabilities
Clickstream analytics
Digital marketing
Networking
Manufacturing
Universal
Speed to understanding
(real-time queries)
Speed to visibility
(real-time ingest)
Reporting
Root cause analysis
DoS protection
Feature roll-out monitoring
Real time monitoring
Overspend protection
Interactive dashboards
Yield fraud analysis
Sales enablement (inventory
discovery)
Traditional BI
Commonality analysis
Dynamic A/B testing
Capacity planning / peering
17

Why data warehouses cannot support hot analytics

More Related Content

What's hot

Similar to Why data warehouses cannot support hot analytics

More from Imply

Recently uploaded

Why data warehouses cannot support hot analytics