SlideShare a Scribd company logo
1 of 33
Download to read offline
ClickHouse for
Experimentation
Gleb Kanterov
@kanterov
2018-07-03
170M Monthly Active Users
75M Subscribers
35M Tracks
65 Markets
[1] https://investors.spotify.com
Quick Facts
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Move fast, break things
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Move fast, break things
Ask for forgiveness, not for permission
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Move fast, break things
Ask for forgiveness, not for permission
AUTONOMY
Hadoop@Spotify
● On-Premise
● 2,500 nodes
● 100 PB Disk
● 100 TB RAM
● 100B+ events per day
● 20K+ jobs per day
Hadoop@Spotify
● Migration from On-Premise to GCP
● Moved 100 PB of data
● Our Hadoop cluster is dead
Hadoop@Spotify
What are
experiments,
and why
ClickHouse?
Randomized
Controlled
Experiment
Randomized
Controlled
Experiment
All Khan Academy content is available for free at www.khanacademy.org
Randomized
Controlled
Experiment
All Khan Academy content is available for free at www.khanacademy.org
Randomized
Controlled
Experiment
All Khan Academy content is available for free at www.khanacademy.org
Randomized
Controlled
Experiment
An experiment where all subjects
involved in the experiment are treated
the same except for one deviation.
One variable is changed in order to
isolate the results.
All Khan Academy content is available for free at www.khanacademy.org
A/B Testing
A/B Testing is a randomized controlled experiment where one variable is tested.
E.g., hypothesis Our new recommendation algorithm increases content consumption.
How to verify?
1. Formulate hypothesis
2. Run A/B test
3. See if there is a statistically significant increase in consumption.
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
Product
Owners
Data
Scientists
Granular Data
BigQuery
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
1. Event Delivery
Developers instrument
applications and services
using SDK.
Events are collected and
published to Pub/Sub.
Batch jobs read data from
Pub/Sub, deduplicate and
anonymize, and then store in
hourly partitions on GCS.
Exposing users to
experiments, and configuring
A/B variations on clients is
done by dedicates services.
Product
Owners
Data
Scientists
Granular Data
BigQuery
1
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
2. Data Pipelines and
Storage
Data gets transformed and
aggregated using Dataflow
batch jobs, and stored in
Bigtable, GCS and BigQuery.
Bigtable contains
pre-computed aggregated
experiment results.
BigQuery has granular data
used in ad-hoc analysis.
Product
Owners
Data
Scientists
Granular Data
BigQuery
2
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
3. Presentation
Users of Experimentation
platform see their experiment
results in web application.
Statistical tests and health
checks are performed
automatically.
Metrics for Experimentation Platform v1
Product
Owners
Data
Scientists
Granular Data
BigQuery
3
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
4. Ad-hoc Analytics
Data scientists do ad-hoc
exploration in Jupyter
notebooks using BigQuery.
Here they answer experiment
specific-questions, not
automatically supported by
experimentation system.
Product
Owners
Data
Scientists
Granular Data
BigQuery
4
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
What works well
Centralized team owning
100-s of core metrics.
Automatic experiment
analysis and planning.
Allows to conclude
experiments without manual
analysis. Autonomous feature
teams can move fast and
iterate on their product.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
Problems
Not every metric worths
centralization.
Centralized team became a
bottleneck for Feature
features.
As a result, too much
repetitive work goes into
notebooks and ad-hoc
queries.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Granular Data
OLAP Database
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
Reasons
1. Experimentation isn’t only
about hypothesis testing, but
learning from experiments.
Aggregated data in Bigtable
wasn’t granular enough, and
didn't have enough
dimensions.
2. Can’t add a new metric
without involving a central
team.
What we want
Provide teams more granular
data out of the box, and give
a way to define a new metric.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Requirements
● Serve 100-s of QPS with sub-second latency
● We know in advance what are queries and data
● Maintain 10x metrics with the same cost
● Thousands of metrics
● Billions of rows per day in each of 100-s of tables
● Ready to be used out of the box
● Leverage existing infrastructure as much as feasible
● Hide unnecessary complexity from internal users
What about BigQuery?
● Supports Standard SQL
● Don’t have to optimize datasets in advance
● Works great for heavy queries with joins among multiple datasets
● Doesn’t need operations and machines running
● Good for interactive ad-hoc queries (~ minutes)
● Isn’t best for a high amount of low-latency queries you are aware in advance
Why ClickHouse?
● Build proof of concept using various OLAP storages (ClickHouse, Druid, Pinot, ...)
● ClickHouse has the most simple architecture
● Powerful SQL dialect close to Standard SQL
● A comprehensive set of built-in functions and aggregators
● Was ready to be used out of the box
● Superset integration is great
● Easy to query using clickhouse-jdbc and jooq
Event Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Product TeamsStorage
Granular Data
ClickHouse
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Metrics for Experimentation Platform v2
5. ClickHouse
Interactive queries on
granular data.
Reduce demand in notebooks
and BigQuery with
dashboards and exploration
in Superset.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Superset
5
Event Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Product TeamsStorage
Granular Data
ClickHouse
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Metrics for Experimentation Platform v2
6. Metrics Catalog
Centralized place for teams to
define their own metrics.
20 minutes to define a metric. Product
Owners
Data
Scientists
Granular Data
BigQuery
Superset
Metrics Catalog
Metrics API
Metric
definitions
6
What we have built
● Own DSL to define metrics, and centralized metrics catalog
● Expressive and simple model that we can efficiently scale to 1000-s of metrics
● Generalize existing components to work with Metrics DSL
○ data preparation and ingestion into ClickHouse
○ denormalization with conformed dimensions
○ create dashboards, tables and charts in Superset
○ do statistical tests, and expose results through API
○ define ownership, tiering, and other attributes
○ integrates with the rest of infrastructure for alerting, monitoring,
data quality, anomaly detection, access control & etc
● Users don’t work with ClickHouse SQL, or need to know how it works
● API to query metrics and metadata
Ingestion to ClickHouse
● Move data from GCS to ClickHouse
● Use clickhouse-jdbc, custom code and RowBinary format
● Use daily partitioning, and ingest once a day
● 1 hour to ingest 5 TiB on test cluster using 9 n1-standard-32 with 8 NVMe SSD RAID0
● Don’t use materialized views in ClickHouse
● Offload most of computations to batch data pipelines due to scalability, experience and
tooling
● TODO try ClickHouse-Native-JDBC
● TODO pre-sort in data pipelines before ingesting
What is next
● Do lambda-style ingestion for subset of metrics with low-latency requirements
● Add more aggregations to DSL (e.g. 5 statistical moments)
● Add custom chart types to Superset
● Try ClickHouse for similar use cases within Spotify
Using ClickHouse for Experimentation

More Related Content

What's hot

Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouseAltinity Ltd
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howAltinity Ltd
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOAltinity Ltd
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAltinity Ltd
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovAltinity Ltd
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...Altinity Ltd
 
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Altinity Ltd
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data AnalyticsSupersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analyticsmason_s
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...Altinity Ltd
 
Altinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Ltd
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Redpanda and ClickHouse
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouseAltinity Ltd
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOAltinity Ltd
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovAltinity Ltd
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesFlink Forward
 

What's hot (20)

Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and how
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
 
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data AnalyticsSupersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Altinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouse
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Redpanda and ClickHouse
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouse
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei Milovidov
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 

Similar to Using ClickHouse for Experimentation

Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big queryGoogle for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big queryGoogle Cloud Platform - Japan
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...Márton Kodok
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperMárton Kodok
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analyticsMariaDB plc
 
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use CasesTatvic Analytics
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query BasicsIdo Green
 
bigquery.pptx
bigquery.pptxbigquery.pptx
bigquery.pptxHarissh16
 
Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018Romit Mehta
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Jaroslav Gergic
 
Google Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery WebinarGoogle Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery WebinarRasel Rana
 
OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata
 
Group 3 slide presentation
Group 3 slide presentationGroup 3 slide presentation
Group 3 slide presentationMichael Young
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupMárton Kodok
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)Stratebi
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 

Similar to Using ClickHouse for Experimentation (20)

Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big queryGoogle for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analytics
 
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
bigquery.pptx
bigquery.pptxbigquery.pptx
bigquery.pptx
 
Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
 
Google Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery WebinarGoogle Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery Webinar
 
OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024
 
Group 3 slide presentation
Group 3 slide presentationGroup 3 slide presentation
Group 3 slide presentation
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch Warmup
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Druid @ branch
Druid @ branch Druid @ branch
Druid @ branch
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Big query
Big queryBig query
Big query
 
Big Trends in Big Data
Big Trends in Big DataBig Trends in Big Data
Big Trends in Big Data
 

Recently uploaded

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Lecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).pptLecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).pptesrabilgic2
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
1C_PNS.pdf Philippines National standard
1C_PNS.pdf Philippines National standard1C_PNS.pdf Philippines National standard
1C_PNS.pdf Philippines National standardraffietividad53
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 

Recently uploaded (20)

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Lecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).pptLecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).ppt
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
1C_PNS.pdf Philippines National standard
1C_PNS.pdf Philippines National standard1C_PNS.pdf Philippines National standard
1C_PNS.pdf Philippines National standard
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 

Using ClickHouse for Experimentation

  • 2. 170M Monthly Active Users 75M Subscribers 35M Tracks 65 Markets [1] https://investors.spotify.com Quick Facts
  • 3. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads
  • 4. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads Move fast, break things
  • 5. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads Move fast, break things Ask for forgiveness, not for permission
  • 6. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads Move fast, break things Ask for forgiveness, not for permission AUTONOMY
  • 8. ● On-Premise ● 2,500 nodes ● 100 PB Disk ● 100 TB RAM ● 100B+ events per day ● 20K+ jobs per day Hadoop@Spotify
  • 9. ● Migration from On-Premise to GCP ● Moved 100 PB of data ● Our Hadoop cluster is dead Hadoop@Spotify
  • 12. Randomized Controlled Experiment All Khan Academy content is available for free at www.khanacademy.org
  • 13. Randomized Controlled Experiment All Khan Academy content is available for free at www.khanacademy.org
  • 14. Randomized Controlled Experiment All Khan Academy content is available for free at www.khanacademy.org
  • 15. Randomized Controlled Experiment An experiment where all subjects involved in the experiment are treated the same except for one deviation. One variable is changed in order to isolate the results. All Khan Academy content is available for free at www.khanacademy.org
  • 16. A/B Testing A/B Testing is a randomized controlled experiment where one variable is tested. E.g., hypothesis Our new recommendation algorithm increases content consumption. How to verify? 1. Formulate hypothesis 2. Run A/B test 3. See if there is a statistically significant increase in consumption.
  • 17. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 Product Owners Data Scientists Granular Data BigQuery
  • 18. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 1. Event Delivery Developers instrument applications and services using SDK. Events are collected and published to Pub/Sub. Batch jobs read data from Pub/Sub, deduplicate and anonymize, and then store in hourly partitions on GCS. Exposing users to experiments, and configuring A/B variations on clients is done by dedicates services. Product Owners Data Scientists Granular Data BigQuery 1
  • 19. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 2. Data Pipelines and Storage Data gets transformed and aggregated using Dataflow batch jobs, and stored in Bigtable, GCS and BigQuery. Bigtable contains pre-computed aggregated experiment results. BigQuery has granular data used in ad-hoc analysis. Product Owners Data Scientists Granular Data BigQuery 2
  • 20. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine 3. Presentation Users of Experimentation platform see their experiment results in web application. Statistical tests and health checks are performed automatically. Metrics for Experimentation Platform v1 Product Owners Data Scientists Granular Data BigQuery 3
  • 21. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 4. Ad-hoc Analytics Data scientists do ad-hoc exploration in Jupyter notebooks using BigQuery. Here they answer experiment specific-questions, not automatically supported by experimentation system. Product Owners Data Scientists Granular Data BigQuery 4
  • 22. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 What works well Centralized team owning 100-s of core metrics. Automatic experiment analysis and planning. Allows to conclude experiments without manual analysis. Autonomous feature teams can move fast and iterate on their product. Product Owners Data Scientists Granular Data BigQuery
  • 23. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 Problems Not every metric worths centralization. Centralized team became a bottleneck for Feature features. As a result, too much repetitive work goes into notebooks and ad-hoc queries. Product Owners Data Scientists Granular Data BigQuery
  • 24. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Granular Data OLAP Database Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 Reasons 1. Experimentation isn’t only about hypothesis testing, but learning from experiments. Aggregated data in Bigtable wasn’t granular enough, and didn't have enough dimensions. 2. Can’t add a new metric without involving a central team. What we want Provide teams more granular data out of the box, and give a way to define a new metric. Product Owners Data Scientists Granular Data BigQuery
  • 25. Requirements ● Serve 100-s of QPS with sub-second latency ● We know in advance what are queries and data ● Maintain 10x metrics with the same cost ● Thousands of metrics ● Billions of rows per day in each of 100-s of tables ● Ready to be used out of the box ● Leverage existing infrastructure as much as feasible ● Hide unnecessary complexity from internal users
  • 26. What about BigQuery? ● Supports Standard SQL ● Don’t have to optimize datasets in advance ● Works great for heavy queries with joins among multiple datasets ● Doesn’t need operations and machines running ● Good for interactive ad-hoc queries (~ minutes) ● Isn’t best for a high amount of low-latency queries you are aware in advance
  • 27. Why ClickHouse? ● Build proof of concept using various OLAP storages (ClickHouse, Druid, Pinot, ...) ● ClickHouse has the most simple architecture ● Powerful SQL dialect close to Standard SQL ● A comprehensive set of built-in functions and aggregators ● Was ready to be used out of the box ● Superset integration is great ● Easy to query using clickhouse-jdbc and jooq
  • 28. Event Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Product TeamsStorage Granular Data ClickHouse Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Metrics for Experimentation Platform v2 5. ClickHouse Interactive queries on granular data. Reduce demand in notebooks and BigQuery with dashboards and exploration in Superset. Product Owners Data Scientists Granular Data BigQuery Superset 5
  • 29. Event Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Product TeamsStorage Granular Data ClickHouse Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Metrics for Experimentation Platform v2 6. Metrics Catalog Centralized place for teams to define their own metrics. 20 minutes to define a metric. Product Owners Data Scientists Granular Data BigQuery Superset Metrics Catalog Metrics API Metric definitions 6
  • 30. What we have built ● Own DSL to define metrics, and centralized metrics catalog ● Expressive and simple model that we can efficiently scale to 1000-s of metrics ● Generalize existing components to work with Metrics DSL ○ data preparation and ingestion into ClickHouse ○ denormalization with conformed dimensions ○ create dashboards, tables and charts in Superset ○ do statistical tests, and expose results through API ○ define ownership, tiering, and other attributes ○ integrates with the rest of infrastructure for alerting, monitoring, data quality, anomaly detection, access control & etc ● Users don’t work with ClickHouse SQL, or need to know how it works ● API to query metrics and metadata
  • 31. Ingestion to ClickHouse ● Move data from GCS to ClickHouse ● Use clickhouse-jdbc, custom code and RowBinary format ● Use daily partitioning, and ingest once a day ● 1 hour to ingest 5 TiB on test cluster using 9 n1-standard-32 with 8 NVMe SSD RAID0 ● Don’t use materialized views in ClickHouse ● Offload most of computations to batch data pipelines due to scalability, experience and tooling ● TODO try ClickHouse-Native-JDBC ● TODO pre-sort in data pipelines before ingesting
  • 32. What is next ● Do lambda-style ingestion for subset of metrics with low-latency requirements ● Add more aggregations to DSL (e.g. 5 statistical moments) ● Add custom chart types to Superset ● Try ClickHouse for similar use cases within Spotify