SlideShare a Scribd company logo
ClickHouse for
Experimentation
Gleb Kanterov
@kanterov
2018-07-03
170M Monthly Active Users
75M Subscribers
35M Tracks
65 Markets
[1] https://investors.spotify.com
Quick Facts
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Move fast, break things
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Move fast, break things
Ask for forgiveness, not for permission
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Move fast, break things
Ask for forgiveness, not for permission
AUTONOMY
Hadoop@Spotify
● On-Premise
● 2,500 nodes
● 100 PB Disk
● 100 TB RAM
● 100B+ events per day
● 20K+ jobs per day
Hadoop@Spotify
● Migration from On-Premise to GCP
● Moved 100 PB of data
● Our Hadoop cluster is dead
Hadoop@Spotify
What are
experiments,
and why
ClickHouse?
Randomized
Controlled
Experiment
Randomized
Controlled
Experiment
All Khan Academy content is available for free at www.khanacademy.org
Randomized
Controlled
Experiment
All Khan Academy content is available for free at www.khanacademy.org
Randomized
Controlled
Experiment
All Khan Academy content is available for free at www.khanacademy.org
Randomized
Controlled
Experiment
An experiment where all subjects
involved in the experiment are treated
the same except for one deviation.
One variable is changed in order to
isolate the results.
All Khan Academy content is available for free at www.khanacademy.org
A/B Testing
A/B Testing is a randomized controlled experiment where one variable is tested.
E.g., hypothesis Our new recommendation algorithm increases content consumption.
How to verify?
1. Formulate hypothesis
2. Run A/B test
3. See if there is a statistically significant increase in consumption.
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
Product
Owners
Data
Scientists
Granular Data
BigQuery
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
1. Event Delivery
Developers instrument
applications and services
using SDK.
Events are collected and
published to Pub/Sub.
Batch jobs read data from
Pub/Sub, deduplicate and
anonymize, and then store in
hourly partitions on GCS.
Exposing users to
experiments, and configuring
A/B variations on clients is
done by dedicates services.
Product
Owners
Data
Scientists
Granular Data
BigQuery
1
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
2. Data Pipelines and
Storage
Data gets transformed and
aggregated using Dataflow
batch jobs, and stored in
Bigtable, GCS and BigQuery.
Bigtable contains
pre-computed aggregated
experiment results.
BigQuery has granular data
used in ad-hoc analysis.
Product
Owners
Data
Scientists
Granular Data
BigQuery
2
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
3. Presentation
Users of Experimentation
platform see their experiment
results in web application.
Statistical tests and health
checks are performed
automatically.
Metrics for Experimentation Platform v1
Product
Owners
Data
Scientists
Granular Data
BigQuery
3
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
4. Ad-hoc Analytics
Data scientists do ad-hoc
exploration in Jupyter
notebooks using BigQuery.
Here they answer experiment
specific-questions, not
automatically supported by
experimentation system.
Product
Owners
Data
Scientists
Granular Data
BigQuery
4
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
What works well
Centralized team owning
100-s of core metrics.
Automatic experiment
analysis and planning.
Allows to conclude
experiments without manual
analysis. Autonomous feature
teams can move fast and
iterate on their product.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
Problems
Not every metric worths
centralization.
Centralized team became a
bottleneck for Feature
features.
As a result, too much
repetitive work goes into
notebooks and ad-hoc
queries.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Granular Data
OLAP Database
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
Reasons
1. Experimentation isn’t only
about hypothesis testing, but
learning from experiments.
Aggregated data in Bigtable
wasn’t granular enough, and
didn't have enough
dimensions.
2. Can’t add a new metric
without involving a central
team.
What we want
Provide teams more granular
data out of the box, and give
a way to define a new metric.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Requirements
● Serve 100-s of QPS with sub-second latency
● We know in advance what are queries and data
● Maintain 10x metrics with the same cost
● Thousands of metrics
● Billions of rows per day in each of 100-s of tables
● Ready to be used out of the box
● Leverage existing infrastructure as much as feasible
● Hide unnecessary complexity from internal users
What about BigQuery?
● Supports Standard SQL
● Don’t have to optimize datasets in advance
● Works great for heavy queries with joins among multiple datasets
● Doesn’t need operations and machines running
● Good for interactive ad-hoc queries (~ minutes)
● Isn’t best for a high amount of low-latency queries you are aware in advance
Why ClickHouse?
● Build proof of concept using various OLAP storages (ClickHouse, Druid, Pinot, ...)
● ClickHouse has the most simple architecture
● Powerful SQL dialect close to Standard SQL
● A comprehensive set of built-in functions and aggregators
● Was ready to be used out of the box
● Superset integration is great
● Easy to query using clickhouse-jdbc and jooq
Event Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Product TeamsStorage
Granular Data
ClickHouse
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Metrics for Experimentation Platform v2
5. ClickHouse
Interactive queries on
granular data.
Reduce demand in notebooks
and BigQuery with
dashboards and exploration
in Superset.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Superset
5
Event Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Product TeamsStorage
Granular Data
ClickHouse
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Metrics for Experimentation Platform v2
6. Metrics Catalog
Centralized place for teams to
define their own metrics.
20 minutes to define a metric. Product
Owners
Data
Scientists
Granular Data
BigQuery
Superset
Metrics Catalog
Metrics API
Metric
definitions
6
What we have built
● Own DSL to define metrics, and centralized metrics catalog
● Expressive and simple model that we can efficiently scale to 1000-s of metrics
● Generalize existing components to work with Metrics DSL
○ data preparation and ingestion into ClickHouse
○ denormalization with conformed dimensions
○ create dashboards, tables and charts in Superset
○ do statistical tests, and expose results through API
○ define ownership, tiering, and other attributes
○ integrates with the rest of infrastructure for alerting, monitoring,
data quality, anomaly detection, access control & etc
● Users don’t work with ClickHouse SQL, or need to know how it works
● API to query metrics and metadata
Ingestion to ClickHouse
● Move data from GCS to ClickHouse
● Use clickhouse-jdbc, custom code and RowBinary format
● Use daily partitioning, and ingest once a day
● 1 hour to ingest 5 TiB on test cluster using 9 n1-standard-32 with 8 NVMe SSD RAID0
● Don’t use materialized views in ClickHouse
● Offload most of computations to batch data pipelines due to scalability, experience and
tooling
● TODO try ClickHouse-Native-JDBC
● TODO pre-sort in data pipelines before ingesting
What is next
● Do lambda-style ingestion for subset of metrics with low-latency requirements
● Add more aggregations to DSL (e.g. 5 statistical moments)
● Add custom chart types to Superset
● Try ClickHouse for similar use cases within Spotify
Using ClickHouse for Experimentation

More Related Content

What's hot

10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse
rpolat
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
Altinity Ltd
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
Altinity Ltd
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
Vianney FOUCAULT
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
Alkin Tezuysal
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and how
Altinity Ltd
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
StreamNative
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
Altinity Ltd
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Altinity Ltd
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Altinity Ltd
 
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesWebinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Altinity Ltd
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
HostedbyConfluent
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
Altinity Ltd
 

What's hot (20)

10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and how
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
 
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesWebinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
 

Similar to Using ClickHouse for Experimentation

Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big queryGoogle for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google Cloud Platform - Japan
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
Márton Kodok
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
Márton Kodok
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analytics
MariaDB plc
 
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
Tatvic Analytics
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
Ido Green
 
bigquery.pptx
bigquery.pptxbigquery.pptx
bigquery.pptx
Harissh16
 
Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018
Romit Mehta
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Jaroslav Gergic
 
Google Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery WebinarGoogle Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery Webinar
Rasel Rana
 
OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata
 
Group 3 slide presentation
Group 3 slide presentationGroup 3 slide presentation
Group 3 slide presentation
Michael Young
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Alluxio, Inc.
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch Warmup
Márton Kodok
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
Stratebi
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
DataWorks Summit/Hadoop Summit
 
Druid @ branch
Druid @ branch Druid @ branch
Druid @ branch
Biswajit Das
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
Big query
Big queryBig query
Big query
Tanvi Parikh
 
Big Trends in Big Data
Big Trends in Big DataBig Trends in Big Data
Big Trends in Big Data
Naresh Chintalcheru
 

Similar to Using ClickHouse for Experimentation (20)

Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big queryGoogle for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analytics
 
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
bigquery.pptx
bigquery.pptxbigquery.pptx
bigquery.pptx
 
Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
 
Google Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery WebinarGoogle Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery Webinar
 
OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024
 
Group 3 slide presentation
Group 3 slide presentationGroup 3 slide presentation
Group 3 slide presentation
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch Warmup
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Druid @ branch
Druid @ branch Druid @ branch
Druid @ branch
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Big query
Big queryBig query
Big query
 
Big Trends in Big Data
Big Trends in Big DataBig Trends in Big Data
Big Trends in Big Data
 

Recently uploaded

GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
316895207-SAP-Oil-and-Gas-Downstream-Training.pptx
316895207-SAP-Oil-and-Gas-Downstream-Training.pptx316895207-SAP-Oil-and-Gas-Downstream-Training.pptx
316895207-SAP-Oil-and-Gas-Downstream-Training.pptx
ssuserad3af4
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
Rakesh Kumar R
 
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative AnalysisOdoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Envertis Software Solutions
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
Bert Jan Schrijver
 
Requirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional SafetyRequirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional Safety
Ayan Halder
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
VALiNTRY360
 
Top 9 Trends in Cybersecurity for 2024.pptx
Top 9 Trends in Cybersecurity for 2024.pptxTop 9 Trends in Cybersecurity for 2024.pptx
Top 9 Trends in Cybersecurity for 2024.pptx
devvsandy
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
Yara Milbes
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
Mobile app Development Services | Drona Infotech
Mobile app Development Services  | Drona InfotechMobile app Development Services  | Drona Infotech
Mobile app Development Services | Drona Infotech
Drona Infotech
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
ToXSL Technologies
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
Marcin Chrost
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 

Recently uploaded (20)

GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
316895207-SAP-Oil-and-Gas-Downstream-Training.pptx
316895207-SAP-Oil-and-Gas-Downstream-Training.pptx316895207-SAP-Oil-and-Gas-Downstream-Training.pptx
316895207-SAP-Oil-and-Gas-Downstream-Training.pptx
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
 
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative AnalysisOdoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
 
Requirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional SafetyRequirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional Safety
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
 
Top 9 Trends in Cybersecurity for 2024.pptx
Top 9 Trends in Cybersecurity for 2024.pptxTop 9 Trends in Cybersecurity for 2024.pptx
Top 9 Trends in Cybersecurity for 2024.pptx
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
Mobile app Development Services | Drona Infotech
Mobile app Development Services  | Drona InfotechMobile app Development Services  | Drona Infotech
Mobile app Development Services | Drona Infotech
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 

Using ClickHouse for Experimentation

  • 2. 170M Monthly Active Users 75M Subscribers 35M Tracks 65 Markets [1] https://investors.spotify.com Quick Facts
  • 3. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads
  • 4. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads Move fast, break things
  • 5. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads Move fast, break things Ask for forgiveness, not for permission
  • 6. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads Move fast, break things Ask for forgiveness, not for permission AUTONOMY
  • 8. ● On-Premise ● 2,500 nodes ● 100 PB Disk ● 100 TB RAM ● 100B+ events per day ● 20K+ jobs per day Hadoop@Spotify
  • 9. ● Migration from On-Premise to GCP ● Moved 100 PB of data ● Our Hadoop cluster is dead Hadoop@Spotify
  • 12. Randomized Controlled Experiment All Khan Academy content is available for free at www.khanacademy.org
  • 13. Randomized Controlled Experiment All Khan Academy content is available for free at www.khanacademy.org
  • 14. Randomized Controlled Experiment All Khan Academy content is available for free at www.khanacademy.org
  • 15. Randomized Controlled Experiment An experiment where all subjects involved in the experiment are treated the same except for one deviation. One variable is changed in order to isolate the results. All Khan Academy content is available for free at www.khanacademy.org
  • 16. A/B Testing A/B Testing is a randomized controlled experiment where one variable is tested. E.g., hypothesis Our new recommendation algorithm increases content consumption. How to verify? 1. Formulate hypothesis 2. Run A/B test 3. See if there is a statistically significant increase in consumption.
  • 17. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 Product Owners Data Scientists Granular Data BigQuery
  • 18. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 1. Event Delivery Developers instrument applications and services using SDK. Events are collected and published to Pub/Sub. Batch jobs read data from Pub/Sub, deduplicate and anonymize, and then store in hourly partitions on GCS. Exposing users to experiments, and configuring A/B variations on clients is done by dedicates services. Product Owners Data Scientists Granular Data BigQuery 1
  • 19. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 2. Data Pipelines and Storage Data gets transformed and aggregated using Dataflow batch jobs, and stored in Bigtable, GCS and BigQuery. Bigtable contains pre-computed aggregated experiment results. BigQuery has granular data used in ad-hoc analysis. Product Owners Data Scientists Granular Data BigQuery 2
  • 20. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine 3. Presentation Users of Experimentation platform see their experiment results in web application. Statistical tests and health checks are performed automatically. Metrics for Experimentation Platform v1 Product Owners Data Scientists Granular Data BigQuery 3
  • 21. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 4. Ad-hoc Analytics Data scientists do ad-hoc exploration in Jupyter notebooks using BigQuery. Here they answer experiment specific-questions, not automatically supported by experimentation system. Product Owners Data Scientists Granular Data BigQuery 4
  • 22. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 What works well Centralized team owning 100-s of core metrics. Automatic experiment analysis and planning. Allows to conclude experiments without manual analysis. Autonomous feature teams can move fast and iterate on their product. Product Owners Data Scientists Granular Data BigQuery
  • 23. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 Problems Not every metric worths centralization. Centralized team became a bottleneck for Feature features. As a result, too much repetitive work goes into notebooks and ad-hoc queries. Product Owners Data Scientists Granular Data BigQuery
  • 24. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Granular Data OLAP Database Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 Reasons 1. Experimentation isn’t only about hypothesis testing, but learning from experiments. Aggregated data in Bigtable wasn’t granular enough, and didn't have enough dimensions. 2. Can’t add a new metric without involving a central team. What we want Provide teams more granular data out of the box, and give a way to define a new metric. Product Owners Data Scientists Granular Data BigQuery
  • 25. Requirements ● Serve 100-s of QPS with sub-second latency ● We know in advance what are queries and data ● Maintain 10x metrics with the same cost ● Thousands of metrics ● Billions of rows per day in each of 100-s of tables ● Ready to be used out of the box ● Leverage existing infrastructure as much as feasible ● Hide unnecessary complexity from internal users
  • 26. What about BigQuery? ● Supports Standard SQL ● Don’t have to optimize datasets in advance ● Works great for heavy queries with joins among multiple datasets ● Doesn’t need operations and machines running ● Good for interactive ad-hoc queries (~ minutes) ● Isn’t best for a high amount of low-latency queries you are aware in advance
  • 27. Why ClickHouse? ● Build proof of concept using various OLAP storages (ClickHouse, Druid, Pinot, ...) ● ClickHouse has the most simple architecture ● Powerful SQL dialect close to Standard SQL ● A comprehensive set of built-in functions and aggregators ● Was ready to be used out of the box ● Superset integration is great ● Easy to query using clickhouse-jdbc and jooq
  • 28. Event Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Product TeamsStorage Granular Data ClickHouse Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Metrics for Experimentation Platform v2 5. ClickHouse Interactive queries on granular data. Reduce demand in notebooks and BigQuery with dashboards and exploration in Superset. Product Owners Data Scientists Granular Data BigQuery Superset 5
  • 29. Event Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Product TeamsStorage Granular Data ClickHouse Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Metrics for Experimentation Platform v2 6. Metrics Catalog Centralized place for teams to define their own metrics. 20 minutes to define a metric. Product Owners Data Scientists Granular Data BigQuery Superset Metrics Catalog Metrics API Metric definitions 6
  • 30. What we have built ● Own DSL to define metrics, and centralized metrics catalog ● Expressive and simple model that we can efficiently scale to 1000-s of metrics ● Generalize existing components to work with Metrics DSL ○ data preparation and ingestion into ClickHouse ○ denormalization with conformed dimensions ○ create dashboards, tables and charts in Superset ○ do statistical tests, and expose results through API ○ define ownership, tiering, and other attributes ○ integrates with the rest of infrastructure for alerting, monitoring, data quality, anomaly detection, access control & etc ● Users don’t work with ClickHouse SQL, or need to know how it works ● API to query metrics and metadata
  • 31. Ingestion to ClickHouse ● Move data from GCS to ClickHouse ● Use clickhouse-jdbc, custom code and RowBinary format ● Use daily partitioning, and ingest once a day ● 1 hour to ingest 5 TiB on test cluster using 9 n1-standard-32 with 8 NVMe SSD RAID0 ● Don’t use materialized views in ClickHouse ● Offload most of computations to batch data pipelines due to scalability, experience and tooling ● TODO try ClickHouse-Native-JDBC ● TODO pre-sort in data pipelines before ingesting
  • 32. What is next ● Do lambda-style ingestion for subset of metrics with low-latency requirements ● Add more aggregations to DSL (e.g. 5 statistical moments) ● Add custom chart types to Superset ● Try ClickHouse for similar use cases within Spotify