© Imply
Why data warehouses cannot support hot analytics
Gian Merlino
CTO and cofounder, Imply
June 24, 2020
1
© Imply© Imply
● Define “temperature-tiered” analytics
● Discuss workflows surrounding “hot analytics”
● Introduce real-time data platforms as a solution
● Discuss recent performance benchmark results
● Discuss how this differentiates from popular cloud data warehouses
What we’ll discuss
2
© Imply© Imply
What is hot analytics?
3
Fast analytics using fresh data for all
● Fast - sub-second query response
● Fresh - streaming, real-time (hot) data
● For all - self-service UX for business people (beyond analysts)
© Imply
Warm
● Most data is available
● Moderate cost
● Moderate performance
Cold
● All data is available
● Low cost
● Not performance sensitive
Not all workloads are equal
Hot
● Business-critical datasets
● Always online
● Latency extremely important
4
© Imply
Hot data powers monitoring and exploration
Drag-and-drop ad-hoc explorationMonitor trends in dashboards
5
© Imply
Where hot analytics is required
Real-time data
warehousing
Metrics
& APM
IoT/timeseriesDigital ads
OLAP queries Network
performance
ClickstreamsRisk/fraud
6
© Imply
Data warehouses cannot support hot analytics
7
Ingestion
Batch processing
Stale
data
Minutes to
hours
Compute
Data lake
Data warehouse
Slow
queries
Seconds to
minutes
Use
Reports
Analytics tools
Specialists
static reporting
10s of seconds
© Imply
Phase I
Lay the groundwork
Phase II
Build the case
Phase III
Make it
defensible
How to add hot analytics
8
Integration Computing Visualization
Stream processing Real-time analytics
database
Self-service analytics
Batch processing
Data lake Reports
Data warehouse Analytics tools
© Imply© Imply
A real-time data platform is required for hot analytics
9
Defining characteristics of a real-time data platform
● Native streaming ingestion and instant data visibility
● Vertically integrated storage, compute and visualization
● Separately scaling ingestion and querying
● Server tiering
● Query prioritization
+ Plus the standard items you expect in a modern analytics platform
+ Cloud-native
+ Elastic
+ Secure
+ Self-healing
+ Zero downtime for software upgrades
© Imply© Imply
Introducing Druid
10
● “high performance”: bread-and-butter fast scan rates + ‘tricks’
● “real-time”: streaming ingestion, interactive query speeds
● “analytics”: counting, ranking, groupBy, time trend
● “database”: the cluster stores a copy of your data and helps you manage it
© Imply© Imply
Introducing Druid
11
● Column oriented
● High concurrency
● Scalable to 1000+ servers
● Continuous, real-time ingest
● Query through SQL
● Target query latency sub-second to a few seconds
© Imply© Imply
Not convinced? We ran a benchmark.
12
© Imply© Imply
Druid vs BigQuery: 3x performance advantage
13
Source: Apache Druid and Google BigQuery Performance Evaluation, 2020
© Imply© Imply
Druid vs BigQuery: 12x price-performance advantage
14
Source: Apache Druid and Google BigQuery Performance Evaluation, 2020
$194,600 monthly
savings at high
concurrency.
© Imply
Comparing Druid and cloud data warehouses
15
Secondary indexes
Cloud-native w/o
speed compromise
● CDWs do not offer indexes beyond the partition key.
● Druid offers space-efficient, compressed secondary indexes.
● CDWs like Snowflake & BigQuery need to retrieve data from remote storage during query execution,
which slows them down.
● Druid preloads data before queries happen.
● Popular CDWs limit latency or throughput of real-time ingestion, if they offer it at all.
● Pull-based ingestion in Druid enables tens of millions of inserts/sec in true real-time.
Pull-based ingestion
© Imply
Comparing Druid and cloud data warehouses
16
Approximate
algorithms
Dynamic lanes
● CDWs offer some approximate algorithms, like count distinct and quantiles.
● Druid offers a wider array of approximate algorithms than any other popular database.
When approximate algorithms are acceptable, they improve performance dramatically.
● CDWs are not designed for running interactive applications.
● Druid uses a ‘fast lane’ to prioritize interactive queries over reporting queries.
Server tiers
● CDWs give you one size that must fit all when it comes to performance and cost.
● Druid lets you control which data gets ‘hot’ vs. ‘warm’ vs. ‘cold’ performance.
© Imply
DWs + BI tools
Our customers have unlocked many new capabilities
Clickstream analytics
Digital marketing
Networking
Manufacturing
Universal
Speed to understanding
(real-time queries)
Speed to visibility
(real-time ingest)
Reporting
Root cause analysis
DoS protection
Feature roll-out monitoring
Real time monitoring
Overspend protection
Interactive dashboards
Yield fraud analysis
Sales enablement (inventory
discovery)
Traditional BI
Commonality analysis
Dynamic A/B testing
Capacity planning / peering
17
© Imply
Questions?
18
© Imply
Thank you!
19

Why data warehouses cannot support hot analytics

  • 1.
    © Imply Why datawarehouses cannot support hot analytics Gian Merlino CTO and cofounder, Imply June 24, 2020 1
  • 2.
    © Imply© Imply ●Define “temperature-tiered” analytics ● Discuss workflows surrounding “hot analytics” ● Introduce real-time data platforms as a solution ● Discuss recent performance benchmark results ● Discuss how this differentiates from popular cloud data warehouses What we’ll discuss 2
  • 3.
    © Imply© Imply Whatis hot analytics? 3 Fast analytics using fresh data for all ● Fast - sub-second query response ● Fresh - streaming, real-time (hot) data ● For all - self-service UX for business people (beyond analysts)
  • 4.
    © Imply Warm ● Mostdata is available ● Moderate cost ● Moderate performance Cold ● All data is available ● Low cost ● Not performance sensitive Not all workloads are equal Hot ● Business-critical datasets ● Always online ● Latency extremely important 4
  • 5.
    © Imply Hot datapowers monitoring and exploration Drag-and-drop ad-hoc explorationMonitor trends in dashboards 5
  • 6.
    © Imply Where hotanalytics is required Real-time data warehousing Metrics & APM IoT/timeseriesDigital ads OLAP queries Network performance ClickstreamsRisk/fraud 6
  • 7.
    © Imply Data warehousescannot support hot analytics 7 Ingestion Batch processing Stale data Minutes to hours Compute Data lake Data warehouse Slow queries Seconds to minutes Use Reports Analytics tools Specialists static reporting 10s of seconds
  • 8.
    © Imply Phase I Laythe groundwork Phase II Build the case Phase III Make it defensible How to add hot analytics 8 Integration Computing Visualization Stream processing Real-time analytics database Self-service analytics Batch processing Data lake Reports Data warehouse Analytics tools
  • 9.
    © Imply© Imply Areal-time data platform is required for hot analytics 9 Defining characteristics of a real-time data platform ● Native streaming ingestion and instant data visibility ● Vertically integrated storage, compute and visualization ● Separately scaling ingestion and querying ● Server tiering ● Query prioritization + Plus the standard items you expect in a modern analytics platform + Cloud-native + Elastic + Secure + Self-healing + Zero downtime for software upgrades
  • 10.
    © Imply© Imply IntroducingDruid 10 ● “high performance”: bread-and-butter fast scan rates + ‘tricks’ ● “real-time”: streaming ingestion, interactive query speeds ● “analytics”: counting, ranking, groupBy, time trend ● “database”: the cluster stores a copy of your data and helps you manage it
  • 11.
    © Imply© Imply IntroducingDruid 11 ● Column oriented ● High concurrency ● Scalable to 1000+ servers ● Continuous, real-time ingest ● Query through SQL ● Target query latency sub-second to a few seconds
  • 12.
    © Imply© Imply Notconvinced? We ran a benchmark. 12
  • 13.
    © Imply© Imply Druidvs BigQuery: 3x performance advantage 13 Source: Apache Druid and Google BigQuery Performance Evaluation, 2020
  • 14.
    © Imply© Imply Druidvs BigQuery: 12x price-performance advantage 14 Source: Apache Druid and Google BigQuery Performance Evaluation, 2020 $194,600 monthly savings at high concurrency.
  • 15.
    © Imply Comparing Druidand cloud data warehouses 15 Secondary indexes Cloud-native w/o speed compromise ● CDWs do not offer indexes beyond the partition key. ● Druid offers space-efficient, compressed secondary indexes. ● CDWs like Snowflake & BigQuery need to retrieve data from remote storage during query execution, which slows them down. ● Druid preloads data before queries happen. ● Popular CDWs limit latency or throughput of real-time ingestion, if they offer it at all. ● Pull-based ingestion in Druid enables tens of millions of inserts/sec in true real-time. Pull-based ingestion
  • 16.
    © Imply Comparing Druidand cloud data warehouses 16 Approximate algorithms Dynamic lanes ● CDWs offer some approximate algorithms, like count distinct and quantiles. ● Druid offers a wider array of approximate algorithms than any other popular database. When approximate algorithms are acceptable, they improve performance dramatically. ● CDWs are not designed for running interactive applications. ● Druid uses a ‘fast lane’ to prioritize interactive queries over reporting queries. Server tiers ● CDWs give you one size that must fit all when it comes to performance and cost. ● Druid lets you control which data gets ‘hot’ vs. ‘warm’ vs. ‘cold’ performance.
  • 17.
    © Imply DWs +BI tools Our customers have unlocked many new capabilities Clickstream analytics Digital marketing Networking Manufacturing Universal Speed to understanding (real-time queries) Speed to visibility (real-time ingest) Reporting Root cause analysis DoS protection Feature roll-out monitoring Real time monitoring Overspend protection Interactive dashboards Yield fraud analysis Sales enablement (inventory discovery) Traditional BI Commonality analysis Dynamic A/B testing Capacity planning / peering 17
  • 18.
  • 19.