SlideShare a Scribd company logo
1 of 36
Download to read offline
Scaling Data Analytics
Workloads on Databricks
Spark + AI Summit, Amsterdam
1
Chris Stevens and Bogdan Ghit
October 17, 2019
2
Bogdan Ghit
• Software Engineer @ Databricks - BI Team
• PhD in datacenter scheduling @ TU Delft
Chris Stevens
• Software Engineer @ Databricks - Serverless Team
• Spent ~10 years doing kernel development
3
AliceBob
A day in the life ...
● Minimize costs
● Simplify maintenance
● Stable system
● Diverse array of tools
● Interactive queries
Bob
Alice
4
Reality ...
Alice
Alice
Bob
5
This Talk
Control Plane
Connectors
Resource management
Provisioning
6
BI Connectors
Control Plane
7
Tableau on Databricks
Client machine
BI tools SparkODBC/JDBC
driver
Webapp
Proxy layer
to/from Spark
clusters
Thrift
ODBC/JDBC
server
Shard Databricks Runtime
8
Event Flow
BI tools SparkODBC/JDBC
driver
ODBC/JDBC
server
Tableau session
Read
metadata
Query execution
construct
protocol
destruct
protocol
Read
metadata
Get
schema
Query
execution
Query
execution
Fetch
data
Get
schema
Webapp
9
Behind the Scenes
Login to Tableau
First cold run of
a TPC-DS query
Multiple warm runs of
a TPC-DS query
The Databricks BI Stack
10
Single-tenant shard
sql/protocolv1/o/0/clusterI
d
Control Plane
Webapp
LB
RDS
CM
Data Plane
W
W
W
Catalyst
Metastore
Scheduler
Spark Driver
SQLProxy
MySQL
Thrift
11
Tableau Connector SDK
Plugin Connector
Manifest File
Custom Dialog File
connectionBuilder
connectionProperties
connectionMatcher
connectionRequired
Connection Resolver
SQL Dialect
Connection String
12
Simplified Connection Dialog
13
Controlling the Dialect
<function group='string' name='SPLIT' return-type='str'>
<formula> CASE
WHEN (%1 IS NULL) THEN
CAST(NULL AS STRING)
WHEN NOT (%3 IS NULL) THEN
COALESCE(
(CASE WHEN %3 &gt; 0 THEN SPLIT(%1, '%2')[%3-1]
ELSE SPLIT(
REVERSE(%1),
REVERSE('%2'))[ABS(%3)-1] END), '')
ELSE NULL END
</formula>
<argument type='str' />
<argument type='localstr' />
<argument type='localint' />
</function>
Achieved 100% compliance with TDVT standard testing
Missing operators: IN_SET, ATAN, CHAR, RSPLIT
Hive dialect wraps DATENAME in COALESCE operators
Strategy to determine if two values are distinct
The datasource supports booleans natively
CASE-WHEN statements should be of boolean type
Polling for Query Results
First poll after 100 ms causing high-latency
for short-lived metadata queries
Blocks for 5 sec if
query is not finished
Thrift
Sends async
polls for result
Async Poll
Cuts in half latency by lowering the polling interval
Metadata Queries are Expensive
Each query triggers 6 round-trips from Tableau to Thrift
We optimize the sequence of metadata queries by retrieving
all needed metadata in one go
SHOWSCHEMAS
SHOWTABLES
DESCtable
DESCtable
Thrift
16
Data Source Connection Latency
New connector delivers 1.7-5x lower latency
17
Resource Management
Control Plane
Execution Time Optimized
18
executors
TPCDS q80-v2.4 every 20 minutes
Fixed, 20 Worker Cluster
Execution Time: 315 seconds
Cost: $14.95 per hour
Cluster Start
Cost Optimized
19
executors
TPCDS q80-v2.4 every 20 minutes
Auto-terminating, 20 Worker Cluster
Cluster Starts
Execution Time: 613 seconds
Cost: $7.66 per hour
Cost vs Execution Time
20
21
Cloud
Provider
Cluster Start Path
1. Cloud Instance
Allocation Requests
2. Instance allocation and setup (55 seconds)
3. Launch Containers (30 secs)
Cluster
Manager 4. Init Spark (15 secs)
Node Start Time
22
● Up to 55 seconds to acquire an
instance from the cloud provider and
initialize it.
● Up to 30 seconds to launch a
container (includes downloading
DBR)
● ~15 seconds to start master/worker
● Median for cluster starts is 2 mins
and 22 seconds.
23
Cloud
Provider
Up-scaling
1. Scheduler Info
2. Cloud Instance
Allocation Requests
3. Cluster Expansion (55 seconds)
4. Launch Containers (30 secs)5. Init Spark (15 secs)
Cluster
Manager
24
Cloud
Provider
Down-scaling
Cluster
Manager
1. Scheduler Info
2. Remove Instance
From Cluster
3. Idle Instance Reclaimed
Basic Autoscaling
25
executors
TPCDS q80-v2.4 every 20 minutes
Standard Autoscaling, 20 Worker Cluster
Execution Time: 318 seconds
Cost: $14.24 per hour
Cluster Start
executors
Optimized Autoscaling
26
executors
TPCDS q80-v2.4 every 20 minutes
Optimized Autoscaling, 20 Worker Cluster
“spark.databricks.aggressiveWindowDownS” -> “40”
Execution Time: 703 seconds
Cost: $7.27 per hour
Cluster Start
Cost vs Execution Time
27
28
Cloud
Provider
Up-scaling with Instance Pools
Cluster
Manager
1. Scheduler Info
Idle Instances
Pool Manager
Cloud Instance
Allocation Requests
(background)
Pool Expansion (55 seconds)
(background)
2. Pool Instance
Allocation Requests
4. Cluster Expansion (~7ms)
3. Pool Instance Allocation
5. Launch Containers and
start Spark (25 secs)
29
Cloud
Provider
Down-scaling with Instance Pools
Cluster
Manager
1. Scheduler Info
Idle Instances
Pool Manager
Cloud Instance
Termination Requests
(background)
Pool Retraction
(background)
3. Idle instance to pool
2. Remove Instance
From Cluster
Node Start Time with Pools
30
● Instance allocation is mostly gone
(milliseconds).
● ~10 second container start (DBR
already downloaded by the pool).
● ~15 seconds to start master/worker
● Median cluster starts is 50 seconds.
2.84x faster.
Optimized Autoscaling with Fixed Pool
31
executors
TPCDS q80-v2.4 every 20 minutes
Optimized Autoscaling w/ Fixed Pool, 20 Worker Cluster
“spark.databricks.aggressiveWindowDownS” -> “40”
Execution Time: 427 seconds
Cost: $10.05 per hour
Cluster Start
Optimized Autoscaling with Dynamic Pool
32
executors
TPCDS q80-v2.4 every 20 minutes
Optimized Autoscaling w/ Dynamic Pool, 20 Worker Cluster
“spark.databricks.aggressiveWindowDownS” -> “40”
Pool has 5 min idle with 10 minute idle timeout.
Execution Time: 557 seconds
Cost: $10.05 per hour
Cluster Start
Cost vs Execution Time
33
Conclusions
Spark is at the core but there’s a lot more
around it to bring it into production
1. Latency-aware integration
2. Efficient resource usage
3. Fast provisioning of resources
34
Control Plane
35
Thanks!
Chris Stevens - linkedin.com/in/chriscstevens
Bogdan Ghit - linkedin.com/in/bogdanghit
Comparison
36
Fixed
Cluster
Ephemeral
Cluster
Standard
Autoscaling
Optimized
Autoscaling
Fixed Pool Dynamic
Pool
Start Latency 218s -4s +499s +434s -27s +280s
Repeat Latency 15s +200s +164s +411s +71s +251s
Start Duration 407s -8s +323s +278s +75s +283s
Repeat Duration 315s +84s +3s +352s +123s +186s
EC2 Cost $6.55 -$3.19 -$1.04 -$2.44 -$0 -$0.38
DBU Cost $8.40 -$4.10 -$1.33 -$3.13 $-3.83 -$3.35

More Related Content

What's hot

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaSpringPeople
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseDatabricks
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Amazon Web Services
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorDatabricks
 

What's hot (20)

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & Kibana
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 

Similar to Scaling Data Analytics Workloads on Databricks

Mininet: Moving Forward
Mininet: Moving ForwardMininet: Moving Forward
Mininet: Moving ForwardON.Lab
 
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"Fwdays
 
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureDataStax Academy
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance SmackdownDataWorks Summit
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffTimescale
 
Druid at naver.com - part 1
Druid at naver.com - part 1Druid at naver.com - part 1
Druid at naver.com - part 1Jungsu Heo
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Ankit Gupta
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksMatthias Niehoff
 
A Consolidation Success Story by Karl Arao
A Consolidation Success Story by Karl AraoA Consolidation Success Story by Karl Arao
A Consolidation Success Story by Karl AraoEnkitec
 
Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28Xavier Lucas
 
Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comDamien Krotkine
 
Kubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of ContainersKubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of ContainersKel Cecil
 
Windows Azure Acid Test
Windows Azure Acid TestWindows Azure Acid Test
Windows Azure Acid Testexpanz
 
S3 cassandra or outer space? dumping time series data using spark
S3 cassandra or outer space? dumping time series data using sparkS3 cassandra or outer space? dumping time series data using spark
S3 cassandra or outer space? dumping time series data using sparkDemi Ben-Ari
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…Sergey Dzyuban
 
Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cl...
Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cl...Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cl...
Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cl...VMware Tanzu
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Spark Summit
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 

Similar to Scaling Data Analytics Workloads on Databricks (20)

Mininet: Moving Forward
Mininet: Moving ForwardMininet: Moving Forward
Mininet: Moving Forward
 
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
 
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & Azure
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
 
Druid at naver.com - part 1
Druid at naver.com - part 1Druid at naver.com - part 1
Druid at naver.com - part 1
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and Frameworks
 
Vinetalk: The missing piece for cluster managers to enable accelerator sharing
Vinetalk: The missing piece for cluster managers to enable accelerator sharingVinetalk: The missing piece for cluster managers to enable accelerator sharing
Vinetalk: The missing piece for cluster managers to enable accelerator sharing
 
A Consolidation Success Story by Karl Arao
A Consolidation Success Story by Karl AraoA Consolidation Success Story by Karl Arao
A Consolidation Success Story by Karl Arao
 
Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28
 
Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.com
 
Kubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of ContainersKubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of Containers
 
Windows Azure Acid Test
Windows Azure Acid TestWindows Azure Acid Test
Windows Azure Acid Test
 
S3 cassandra or outer space? dumping time series data using spark
S3 cassandra or outer space? dumping time series data using sparkS3 cassandra or outer space? dumping time series data using spark
S3 cassandra or outer space? dumping time series data using spark
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…
 
Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cl...
Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cl...Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cl...
Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cl...
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

VIP Russian Call Girls Amravati Chhaya 8250192130 Independent Escort Service ...
VIP Russian Call Girls Amravati Chhaya 8250192130 Independent Escort Service ...VIP Russian Call Girls Amravati Chhaya 8250192130 Independent Escort Service ...
VIP Russian Call Girls Amravati Chhaya 8250192130 Independent Escort Service ...Suhani Kapoor
 
VIP Call Girls in Jamshedpur Aarohi 8250192130 Independent Escort Service Jam...
VIP Call Girls in Jamshedpur Aarohi 8250192130 Independent Escort Service Jam...VIP Call Girls in Jamshedpur Aarohi 8250192130 Independent Escort Service Jam...
VIP Call Girls in Jamshedpur Aarohi 8250192130 Independent Escort Service Jam...Suhani Kapoor
 
VIP Call Girls in Cuttack Aarohi 8250192130 Independent Escort Service Cuttack
VIP Call Girls in Cuttack Aarohi 8250192130 Independent Escort Service CuttackVIP Call Girls in Cuttack Aarohi 8250192130 Independent Escort Service Cuttack
VIP Call Girls in Cuttack Aarohi 8250192130 Independent Escort Service CuttackSuhani Kapoor
 
Storytelling, Ethics and Workflow in Documentary Photography
Storytelling, Ethics and Workflow in Documentary PhotographyStorytelling, Ethics and Workflow in Documentary Photography
Storytelling, Ethics and Workflow in Documentary PhotographyOrtega Alikwe
 
VIP Call Girls Service Saharanpur Aishwarya 8250192130 Independent Escort Ser...
VIP Call Girls Service Saharanpur Aishwarya 8250192130 Independent Escort Ser...VIP Call Girls Service Saharanpur Aishwarya 8250192130 Independent Escort Ser...
VIP Call Girls Service Saharanpur Aishwarya 8250192130 Independent Escort Ser...Suhani Kapoor
 
Call Girl in Low Price Delhi Punjabi Bagh 9711199012
Call Girl in Low Price Delhi Punjabi Bagh  9711199012Call Girl in Low Price Delhi Punjabi Bagh  9711199012
Call Girl in Low Price Delhi Punjabi Bagh 9711199012sapnasaifi408
 
VIP Call Girls Service Jamshedpur Aishwarya 8250192130 Independent Escort Ser...
VIP Call Girls Service Jamshedpur Aishwarya 8250192130 Independent Escort Ser...VIP Call Girls Service Jamshedpur Aishwarya 8250192130 Independent Escort Ser...
VIP Call Girls Service Jamshedpur Aishwarya 8250192130 Independent Escort Ser...Suhani Kapoor
 
加利福尼亚大学伯克利分校硕士毕业证成绩单(价格咨询)学位证书pdf
加利福尼亚大学伯克利分校硕士毕业证成绩单(价格咨询)学位证书pdf加利福尼亚大学伯克利分校硕士毕业证成绩单(价格咨询)学位证书pdf
加利福尼亚大学伯克利分校硕士毕业证成绩单(价格咨询)学位证书pdfobuhobo
 
办理学位证(UoM证书)北安普顿大学毕业证成绩单原版一比一
办理学位证(UoM证书)北安普顿大学毕业证成绩单原版一比一办理学位证(UoM证书)北安普顿大学毕业证成绩单原版一比一
办理学位证(UoM证书)北安普顿大学毕业证成绩单原版一比一A SSS
 
do's and don'ts in Telephone Interview of Job
do's and don'ts in Telephone Interview of Jobdo's and don'ts in Telephone Interview of Job
do's and don'ts in Telephone Interview of JobRemote DBA Services
 
VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...
VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...
VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...Suhani Kapoor
 
How to Find the Best NEET Coaching in Indore (2).pdf
How to Find the Best NEET Coaching in Indore (2).pdfHow to Find the Best NEET Coaching in Indore (2).pdf
How to Find the Best NEET Coaching in Indore (2).pdfmayank158542
 
VIP Russian Call Girls in Bhilai Deepika 8250192130 Independent Escort Servic...
VIP Russian Call Girls in Bhilai Deepika 8250192130 Independent Escort Servic...VIP Russian Call Girls in Bhilai Deepika 8250192130 Independent Escort Servic...
VIP Russian Call Girls in Bhilai Deepika 8250192130 Independent Escort Servic...Suhani Kapoor
 
Gray Gold Clean CV Resume2024tod (1).pdf
Gray Gold Clean CV Resume2024tod (1).pdfGray Gold Clean CV Resume2024tod (1).pdf
Gray Gold Clean CV Resume2024tod (1).pdfpadillaangelina0023
 
定制(Waikato毕业证书)新西兰怀卡托大学毕业证成绩单原版一比一
定制(Waikato毕业证书)新西兰怀卡托大学毕业证成绩单原版一比一定制(Waikato毕业证书)新西兰怀卡托大学毕业证成绩单原版一比一
定制(Waikato毕业证书)新西兰怀卡托大学毕业证成绩单原版一比一Fs
 
Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...
Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...
Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...Suhani Kapoor
 
定制(UOIT学位证)加拿大安大略理工大学毕业证成绩单原版一比一
 定制(UOIT学位证)加拿大安大略理工大学毕业证成绩单原版一比一 定制(UOIT学位证)加拿大安大略理工大学毕业证成绩单原版一比一
定制(UOIT学位证)加拿大安大略理工大学毕业证成绩单原版一比一Fs sss
 
PM Job Search Council Info Session - PMI Silver Spring Chapter
PM Job Search Council Info Session - PMI Silver Spring ChapterPM Job Search Council Info Session - PMI Silver Spring Chapter
PM Job Search Council Info Session - PMI Silver Spring ChapterHector Del Castillo, CPM, CPMM
 

Recently uploaded (20)

VIP Russian Call Girls Amravati Chhaya 8250192130 Independent Escort Service ...
VIP Russian Call Girls Amravati Chhaya 8250192130 Independent Escort Service ...VIP Russian Call Girls Amravati Chhaya 8250192130 Independent Escort Service ...
VIP Russian Call Girls Amravati Chhaya 8250192130 Independent Escort Service ...
 
Call Girls In Prashant Vihar꧁❤ 🔝 9953056974🔝❤꧂ Escort ServiCe
Call Girls In Prashant Vihar꧁❤ 🔝 9953056974🔝❤꧂ Escort ServiCeCall Girls In Prashant Vihar꧁❤ 🔝 9953056974🔝❤꧂ Escort ServiCe
Call Girls In Prashant Vihar꧁❤ 🔝 9953056974🔝❤꧂ Escort ServiCe
 
VIP Call Girls in Jamshedpur Aarohi 8250192130 Independent Escort Service Jam...
VIP Call Girls in Jamshedpur Aarohi 8250192130 Independent Escort Service Jam...VIP Call Girls in Jamshedpur Aarohi 8250192130 Independent Escort Service Jam...
VIP Call Girls in Jamshedpur Aarohi 8250192130 Independent Escort Service Jam...
 
VIP Call Girls in Cuttack Aarohi 8250192130 Independent Escort Service Cuttack
VIP Call Girls in Cuttack Aarohi 8250192130 Independent Escort Service CuttackVIP Call Girls in Cuttack Aarohi 8250192130 Independent Escort Service Cuttack
VIP Call Girls in Cuttack Aarohi 8250192130 Independent Escort Service Cuttack
 
Storytelling, Ethics and Workflow in Documentary Photography
Storytelling, Ethics and Workflow in Documentary PhotographyStorytelling, Ethics and Workflow in Documentary Photography
Storytelling, Ethics and Workflow in Documentary Photography
 
VIP Call Girls Service Saharanpur Aishwarya 8250192130 Independent Escort Ser...
VIP Call Girls Service Saharanpur Aishwarya 8250192130 Independent Escort Ser...VIP Call Girls Service Saharanpur Aishwarya 8250192130 Independent Escort Ser...
VIP Call Girls Service Saharanpur Aishwarya 8250192130 Independent Escort Ser...
 
Call Girl in Low Price Delhi Punjabi Bagh 9711199012
Call Girl in Low Price Delhi Punjabi Bagh  9711199012Call Girl in Low Price Delhi Punjabi Bagh  9711199012
Call Girl in Low Price Delhi Punjabi Bagh 9711199012
 
VIP Call Girls Service Jamshedpur Aishwarya 8250192130 Independent Escort Ser...
VIP Call Girls Service Jamshedpur Aishwarya 8250192130 Independent Escort Ser...VIP Call Girls Service Jamshedpur Aishwarya 8250192130 Independent Escort Ser...
VIP Call Girls Service Jamshedpur Aishwarya 8250192130 Independent Escort Ser...
 
加利福尼亚大学伯克利分校硕士毕业证成绩单(价格咨询)学位证书pdf
加利福尼亚大学伯克利分校硕士毕业证成绩单(价格咨询)学位证书pdf加利福尼亚大学伯克利分校硕士毕业证成绩单(价格咨询)学位证书pdf
加利福尼亚大学伯克利分校硕士毕业证成绩单(价格咨询)学位证书pdf
 
办理学位证(UoM证书)北安普顿大学毕业证成绩单原版一比一
办理学位证(UoM证书)北安普顿大学毕业证成绩单原版一比一办理学位证(UoM证书)北安普顿大学毕业证成绩单原版一比一
办理学位证(UoM证书)北安普顿大学毕业证成绩单原版一比一
 
do's and don'ts in Telephone Interview of Job
do's and don'ts in Telephone Interview of Jobdo's and don'ts in Telephone Interview of Job
do's and don'ts in Telephone Interview of Job
 
VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...
VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...
VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...
 
FULL ENJOY Call Girls In Gautam Nagar (Delhi) Call Us 9953056974
FULL ENJOY Call Girls In Gautam Nagar (Delhi) Call Us 9953056974FULL ENJOY Call Girls In Gautam Nagar (Delhi) Call Us 9953056974
FULL ENJOY Call Girls In Gautam Nagar (Delhi) Call Us 9953056974
 
How to Find the Best NEET Coaching in Indore (2).pdf
How to Find the Best NEET Coaching in Indore (2).pdfHow to Find the Best NEET Coaching in Indore (2).pdf
How to Find the Best NEET Coaching in Indore (2).pdf
 
VIP Russian Call Girls in Bhilai Deepika 8250192130 Independent Escort Servic...
VIP Russian Call Girls in Bhilai Deepika 8250192130 Independent Escort Servic...VIP Russian Call Girls in Bhilai Deepika 8250192130 Independent Escort Servic...
VIP Russian Call Girls in Bhilai Deepika 8250192130 Independent Escort Servic...
 
Gray Gold Clean CV Resume2024tod (1).pdf
Gray Gold Clean CV Resume2024tod (1).pdfGray Gold Clean CV Resume2024tod (1).pdf
Gray Gold Clean CV Resume2024tod (1).pdf
 
定制(Waikato毕业证书)新西兰怀卡托大学毕业证成绩单原版一比一
定制(Waikato毕业证书)新西兰怀卡托大学毕业证成绩单原版一比一定制(Waikato毕业证书)新西兰怀卡托大学毕业证成绩单原版一比一
定制(Waikato毕业证书)新西兰怀卡托大学毕业证成绩单原版一比一
 
Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...
Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...
Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...
 
定制(UOIT学位证)加拿大安大略理工大学毕业证成绩单原版一比一
 定制(UOIT学位证)加拿大安大略理工大学毕业证成绩单原版一比一 定制(UOIT学位证)加拿大安大略理工大学毕业证成绩单原版一比一
定制(UOIT学位证)加拿大安大略理工大学毕业证成绩单原版一比一
 
PM Job Search Council Info Session - PMI Silver Spring Chapter
PM Job Search Council Info Session - PMI Silver Spring ChapterPM Job Search Council Info Session - PMI Silver Spring Chapter
PM Job Search Council Info Session - PMI Silver Spring Chapter
 

Scaling Data Analytics Workloads on Databricks

  • 1. Scaling Data Analytics Workloads on Databricks Spark + AI Summit, Amsterdam 1 Chris Stevens and Bogdan Ghit October 17, 2019
  • 2. 2 Bogdan Ghit • Software Engineer @ Databricks - BI Team • PhD in datacenter scheduling @ TU Delft Chris Stevens • Software Engineer @ Databricks - Serverless Team • Spent ~10 years doing kernel development
  • 3. 3 AliceBob A day in the life ... ● Minimize costs ● Simplify maintenance ● Stable system ● Diverse array of tools ● Interactive queries Bob Alice
  • 7. 7 Tableau on Databricks Client machine BI tools SparkODBC/JDBC driver Webapp Proxy layer to/from Spark clusters Thrift ODBC/JDBC server Shard Databricks Runtime
  • 8. 8 Event Flow BI tools SparkODBC/JDBC driver ODBC/JDBC server Tableau session Read metadata Query execution construct protocol destruct protocol Read metadata Get schema Query execution Query execution Fetch data Get schema Webapp
  • 9. 9 Behind the Scenes Login to Tableau First cold run of a TPC-DS query Multiple warm runs of a TPC-DS query
  • 10. The Databricks BI Stack 10 Single-tenant shard sql/protocolv1/o/0/clusterI d Control Plane Webapp LB RDS CM Data Plane W W W Catalyst Metastore Scheduler Spark Driver SQLProxy MySQL Thrift
  • 11. 11 Tableau Connector SDK Plugin Connector Manifest File Custom Dialog File connectionBuilder connectionProperties connectionMatcher connectionRequired Connection Resolver SQL Dialect Connection String
  • 13. 13 Controlling the Dialect <function group='string' name='SPLIT' return-type='str'> <formula> CASE WHEN (%1 IS NULL) THEN CAST(NULL AS STRING) WHEN NOT (%3 IS NULL) THEN COALESCE( (CASE WHEN %3 &gt; 0 THEN SPLIT(%1, '%2')[%3-1] ELSE SPLIT( REVERSE(%1), REVERSE('%2'))[ABS(%3)-1] END), '') ELSE NULL END </formula> <argument type='str' /> <argument type='localstr' /> <argument type='localint' /> </function> Achieved 100% compliance with TDVT standard testing Missing operators: IN_SET, ATAN, CHAR, RSPLIT Hive dialect wraps DATENAME in COALESCE operators Strategy to determine if two values are distinct The datasource supports booleans natively CASE-WHEN statements should be of boolean type
  • 14. Polling for Query Results First poll after 100 ms causing high-latency for short-lived metadata queries Blocks for 5 sec if query is not finished Thrift Sends async polls for result Async Poll Cuts in half latency by lowering the polling interval
  • 15. Metadata Queries are Expensive Each query triggers 6 round-trips from Tableau to Thrift We optimize the sequence of metadata queries by retrieving all needed metadata in one go SHOWSCHEMAS SHOWTABLES DESCtable DESCtable Thrift
  • 16. 16 Data Source Connection Latency New connector delivers 1.7-5x lower latency
  • 18. Execution Time Optimized 18 executors TPCDS q80-v2.4 every 20 minutes Fixed, 20 Worker Cluster Execution Time: 315 seconds Cost: $14.95 per hour Cluster Start
  • 19. Cost Optimized 19 executors TPCDS q80-v2.4 every 20 minutes Auto-terminating, 20 Worker Cluster Cluster Starts Execution Time: 613 seconds Cost: $7.66 per hour
  • 20. Cost vs Execution Time 20
  • 21. 21 Cloud Provider Cluster Start Path 1. Cloud Instance Allocation Requests 2. Instance allocation and setup (55 seconds) 3. Launch Containers (30 secs) Cluster Manager 4. Init Spark (15 secs)
  • 22. Node Start Time 22 ● Up to 55 seconds to acquire an instance from the cloud provider and initialize it. ● Up to 30 seconds to launch a container (includes downloading DBR) ● ~15 seconds to start master/worker ● Median for cluster starts is 2 mins and 22 seconds.
  • 23. 23 Cloud Provider Up-scaling 1. Scheduler Info 2. Cloud Instance Allocation Requests 3. Cluster Expansion (55 seconds) 4. Launch Containers (30 secs)5. Init Spark (15 secs) Cluster Manager
  • 24. 24 Cloud Provider Down-scaling Cluster Manager 1. Scheduler Info 2. Remove Instance From Cluster 3. Idle Instance Reclaimed
  • 25. Basic Autoscaling 25 executors TPCDS q80-v2.4 every 20 minutes Standard Autoscaling, 20 Worker Cluster Execution Time: 318 seconds Cost: $14.24 per hour Cluster Start executors
  • 26. Optimized Autoscaling 26 executors TPCDS q80-v2.4 every 20 minutes Optimized Autoscaling, 20 Worker Cluster “spark.databricks.aggressiveWindowDownS” -> “40” Execution Time: 703 seconds Cost: $7.27 per hour Cluster Start
  • 27. Cost vs Execution Time 27
  • 28. 28 Cloud Provider Up-scaling with Instance Pools Cluster Manager 1. Scheduler Info Idle Instances Pool Manager Cloud Instance Allocation Requests (background) Pool Expansion (55 seconds) (background) 2. Pool Instance Allocation Requests 4. Cluster Expansion (~7ms) 3. Pool Instance Allocation 5. Launch Containers and start Spark (25 secs)
  • 29. 29 Cloud Provider Down-scaling with Instance Pools Cluster Manager 1. Scheduler Info Idle Instances Pool Manager Cloud Instance Termination Requests (background) Pool Retraction (background) 3. Idle instance to pool 2. Remove Instance From Cluster
  • 30. Node Start Time with Pools 30 ● Instance allocation is mostly gone (milliseconds). ● ~10 second container start (DBR already downloaded by the pool). ● ~15 seconds to start master/worker ● Median cluster starts is 50 seconds. 2.84x faster.
  • 31. Optimized Autoscaling with Fixed Pool 31 executors TPCDS q80-v2.4 every 20 minutes Optimized Autoscaling w/ Fixed Pool, 20 Worker Cluster “spark.databricks.aggressiveWindowDownS” -> “40” Execution Time: 427 seconds Cost: $10.05 per hour Cluster Start
  • 32. Optimized Autoscaling with Dynamic Pool 32 executors TPCDS q80-v2.4 every 20 minutes Optimized Autoscaling w/ Dynamic Pool, 20 Worker Cluster “spark.databricks.aggressiveWindowDownS” -> “40” Pool has 5 min idle with 10 minute idle timeout. Execution Time: 557 seconds Cost: $10.05 per hour Cluster Start
  • 33. Cost vs Execution Time 33
  • 34. Conclusions Spark is at the core but there’s a lot more around it to bring it into production 1. Latency-aware integration 2. Efficient resource usage 3. Fast provisioning of resources 34 Control Plane
  • 35. 35 Thanks! Chris Stevens - linkedin.com/in/chriscstevens Bogdan Ghit - linkedin.com/in/bogdanghit
  • 36. Comparison 36 Fixed Cluster Ephemeral Cluster Standard Autoscaling Optimized Autoscaling Fixed Pool Dynamic Pool Start Latency 218s -4s +499s +434s -27s +280s Repeat Latency 15s +200s +164s +411s +71s +251s Start Duration 407s -8s +323s +278s +75s +283s Repeat Duration 315s +84s +3s +352s +123s +186s EC2 Cost $6.55 -$3.19 -$1.04 -$2.44 -$0 -$0.38 DBU Cost $8.40 -$4.10 -$1.33 -$3.13 $-3.83 -$3.35