SlideShare a Scribd company logo
1 of 26
Download to read offline
Presto on Spark
A Tale of Two Computation Engines
Andrii Rosa
Software Engineer
Wenlei Xie
Research Scientist
Agenda
Introduction
Design & Implementation
Introduction
SQL Use Cases @ Facebook
▪ Reporting and Dashboarding
▪ Low latency (<1s)
▪ High QPS
▪ Presto
▪ Adhoc Analysis
▪ Moderate latency (seconds to minutes)
▪ Mainly Presto
▪ Batch Processing
▪ High latency (up to tens of hours)
▪ Both Presto and Spark
Towards an Unified SQL Experience
▪ Batch Processing Uses Both Presto and Spark
▪ Presto doesn’t scale for large batch pipelines
▪ Inconsistent SQL Experience
▪ SQL Dialect
▪ Subtle Semantic Difference
▪ Null vs. Exception
▪ UDF/UDAF
▪ Best Practice
Presto and Spark Architecture
▪ Designed for latency
▪ MPP Architecture
▪ In-memory shuffle
▪ Shared executor
▪ Designed for Scalability
▪ MapReduce Architecture
▪ Disaggregated shuffle
▪ Isolated executor
SparkPresto
Why Presto (or Other MPPs) Doesn’t Scale?
A Decade-Old Question
SELECT custkey, SUM(totalprice)
FROM orders
GROUP BY custkey
Scan
Scan
Scan
Aggr
Aggr
Aggr
In-memory shuffle
on custkey
Aggr
Execute everything concurrently
- inflexible schedule
- fault-tolerant is difficult
- might exceed memory limit
Presto Unlimited
Brings MapReduce-style execution to MPP architectured runtime
SELECT custkey, SUM(totalprice)
FROM orders
GROUP BY custkey
Scan
Scan
Scan
Write
Write
Write
In-memory shuffle
on custkey
Write
Independent partition execution on
“reducer” side:
- partition-level retry
- schedule a few partitions
concurrently to reduce memory
Aggr
Aggr
Aggr
Aggr
Presto-on-Spark
Executes Presto Evaluation Library on Spark Runtime
SELECT custkey, SUM(totalprice)
FROM orders
GROUP BY custkey
Scan
Scan
Scan
Read
Read
Read
Disagg shuffle
on custkey
Read
Aggr
Aggr
Aggr
Aggr
Stage 1
Stage 2
Why Presto-on-Spark
▪ What are Missing?
▪ Full Disaggregated Shuffle
▪ Isolated Executor
▪ Different Scheduler, Speculative Execution, etc, ...
▪ Embed a “mini-Spark Runtime” inside Presto!
Instead of Making Presto Unlimited More Scalable?
Design & Implementation
Presto-on-Spark Design Principles
▪ Presto is run as a library
▪ Presto cluster is not needed to run
Presto-on-Spark
▪ Presto on Spark is just a Spark application
▪ Query is passed as a parameter
▪ Implemented on RDD level
▪ Operations done by Presto are opaque to Spark
engine
spark-submit
# spark-submit 
--master spark://spark-master:7077 
presto-spark-launcher-*.jar 
--package presto-spark-package-*.tar.gz 
--config ./config.properties 
--catalogs ./catalogs 
--catalog hive 
--schema default 
--file /tmp/query.sql
Planning
Logical PlanQuery Distributed Plan
SELECT *
FROM lineitem l
JOIN orders o
ON l.orderkey = o.orderkey
WHERE o.orderstatus = 'O'
TABLE SCAN
[lineitem]
JOIN
[on orderkey]
TABLE SCAN
[orders]
FILTER
[o.orderstatus = 'O']
Fragment 1 Fragment 2
Fragment 0
PARTITION BY
[orderkey]
PARTITION BY
[orderkey]
FILTER
[o.orderstatus = 'O']
TABLE SCAN
[orders]
TABLE
SCAN
[lineitem]
JOIN
[on orderkey]
Translating to RDD
Fragment 1 Fragment 2
Fragment 0
PARTITION BY
[orderkey]
PARTITION BY
[orderkey]
FILTER
[o.orderstatus = 'O']
TABLE SCAN
[orders]
TABLE
SCAN
[lineitem]
JOIN
[on orderkey]
sparkContext
.parallelize(lineitemSplits)
PairRDD<Integer, Row> = rdd
.mapPartitionsToPair(
fragment1Processor)
sparkContext
.parallelize(ordersSplits)
PairRDD<Integer, Row> = rdd
.mapPartitionsToPair(
fragment2Processor)
pairRdd.partitionBy() pairRdd.partitionBy()
lineitemRdd.zipPartitions(ordersRdd,
fragment0Processor)
Spark DAG
Execution
Fragment 2
FILTER
[o.orderstatus = 'O']
TABLE SCAN
[orders]
Fragment 0
JOIN
[on orderkey]
Leaf Fragment
Intermediate Fragment
Iterator<Tuple2<Integer, PrestoSparkRow>> process(List<Split> splits)
Iterator<Tuple2<Integer, PrestoSparkRow>> process(
List<Iterator<Tuple2<Integer, PrestoSparkRow>>> inputs)
Columnar Format to Row Format Conversion
STAGE 1
INPUT OUTPUTPROJECT FILTERPAGE PAGE PAGEROW ROW
STAGE 2
INPUT OUTPUT
GROUP
BY
FILTERPAGE PAGE PAGEROW ROW
COL 1 VAL 1
COL 1 VAL 2
COL 1 VAL 3
COL 1 VAL 4
COL 1 VAL 5
COL 2 VAL 1
COL 2 VAL 2
COL 2 VAL 3
COL 2 VAL 4
COL 2 VAL 5
COL 3 VAL 1
COL 3 VAL 2
COL 3 VAL 3
COL 3 VAL 4
COL 3 VAL 5
[COL 1 VAL 1], [COL 2 VAL 1], [COL 3 VAL 1]
SHUFFLE
Broadcast Join
Distributed Plan
TABLE SCAN
[lineitem]
JOIN
[on orderkey]
TABLE SCAN
[orders]
FILTER
[o.orderstatus = 'O']
Logical Plan
Fragment 1
Fragment 0
BROADCAST
FILTER
[o.orderstatus = 'O']
TABLE SCAN
[orders]
TABLE
SCAN
[lineitem]
JOIN
[on orderkey]
Job 1
Job 0
Translating to RDD
Fragment 1
Fragment 0
BROADCAST
FILTER
[o.orderstatus = 'O']
TABLE SCAN
[orders]
TABLE
SCAN
[lineitem]
JOIN
[on orderkey]
sparkContext
.parallelize(lineitemSplits)
RDD<Row> = rdd
.mapPartitions(fragment0Processor)
sparkContext
.parallelize(ordersSplits)
RDD<Row> = rdd
.mapPartitions(fragment1Processor)
sc.broadcast(ordersRdd.collect())
Spark DAG
Execution
Broadcast Fragment
Join Fragment
Fragment 1
FILTER
[o.orderstatus = 'O']
TABLE SCAN
[orders]
Iterator<Tuple2<Integer, PrestoSparkRow>> process(List<Split> splits)
Fragment 0
TABLE
SCAN
[lineitem]
JOIN
[on orderkey]
Iterator<Tuple2<Integer, PrestoSparkRow>> process(
List<Split> splits,
List<Broadcast<List<PrestoSparkRow>>> broadcasts)
Threading Model
Presto Task
Spark Task
INPUT OUTPUTPROJECT UNNEST FILTER
INPUT
OUTPUTLOCAL
SHUFFLE
PROJECT UNNEST FILTERPROJECT
PROJECT UNNEST FILTERPROJECT
PROJECT UNNEST FILTERPROJECT
PROJECT UNNEST FILTERPROJECT
PROJECT
Classloader Isolation
Spark Classloader (presto-spark-launcher.jar)
Presto Classloader (presto-spark-package.tar.gz)
int main() {
...
sparkContext.addFile(
“presto-spark-package.tar.gz”
)
...
...
IPrestoSparkService service =
createService(
“presto-spark-package.tar.gz”
)
...
}
Hive Plugin
Classloader
Pinot Plugin
Classloader
MySQL Plugin
Classloader
IPrestoSparkService {
getQueryExecutionFactory();
getTaskExecutorFactory();
}
Current Status
▪ Under Active Development on GitHub: #13856
▪ Most query shapes supported
▪ Working on supporting remaining query shapes (some flavors of UNION ALL)
▪ Preparing the feature to become GA
▪ Initial Scalability Tests
▪ Scale to 10,000 Mappers / Reducers
▪ Supports Queries Require 50TB+ Distributed Memory in Presto
▪ Up to 3x Wall Time Reduction for Presto Large Batch Queries (6h in Presto vs 2h in Presto on Spark)
Q&A

More Related Content

What's hot

A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code GenerationDatabricks
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInDatabricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 

What's hot (20)

A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedIn
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 

Similar to Presto on Apache Spark: A Tale of Two Computation Engines

Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks
 
How Spark Does It Internally?
How Spark Does It Internally?How Spark Does It Internally?
How Spark Does It Internally?Knoldus Inc.
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkWill Du
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to SparkKyle Burke
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0datamantra
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks
 
Data Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkData Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkEren Avşaroğulları
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkDongwon Kim
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 

Similar to Presto on Apache Spark: A Tale of Two Computation Engines (20)

Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 
How Spark Does It Internally?
How Spark Does It Internally?How Spark Does It Internally?
How Spark Does It Internally?
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Data Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkData Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup Talk
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 

Recently uploaded (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 

Presto on Apache Spark: A Tale of Two Computation Engines

  • 1.
  • 2. Presto on Spark A Tale of Two Computation Engines Andrii Rosa Software Engineer Wenlei Xie Research Scientist
  • 5. SQL Use Cases @ Facebook ▪ Reporting and Dashboarding ▪ Low latency (<1s) ▪ High QPS ▪ Presto ▪ Adhoc Analysis ▪ Moderate latency (seconds to minutes) ▪ Mainly Presto ▪ Batch Processing ▪ High latency (up to tens of hours) ▪ Both Presto and Spark
  • 6. Towards an Unified SQL Experience ▪ Batch Processing Uses Both Presto and Spark ▪ Presto doesn’t scale for large batch pipelines ▪ Inconsistent SQL Experience ▪ SQL Dialect ▪ Subtle Semantic Difference ▪ Null vs. Exception ▪ UDF/UDAF ▪ Best Practice
  • 7. Presto and Spark Architecture ▪ Designed for latency ▪ MPP Architecture ▪ In-memory shuffle ▪ Shared executor ▪ Designed for Scalability ▪ MapReduce Architecture ▪ Disaggregated shuffle ▪ Isolated executor SparkPresto
  • 8. Why Presto (or Other MPPs) Doesn’t Scale? A Decade-Old Question SELECT custkey, SUM(totalprice) FROM orders GROUP BY custkey Scan Scan Scan Aggr Aggr Aggr In-memory shuffle on custkey Aggr Execute everything concurrently - inflexible schedule - fault-tolerant is difficult - might exceed memory limit
  • 9. Presto Unlimited Brings MapReduce-style execution to MPP architectured runtime SELECT custkey, SUM(totalprice) FROM orders GROUP BY custkey Scan Scan Scan Write Write Write In-memory shuffle on custkey Write Independent partition execution on “reducer” side: - partition-level retry - schedule a few partitions concurrently to reduce memory Aggr Aggr Aggr Aggr
  • 10. Presto-on-Spark Executes Presto Evaluation Library on Spark Runtime SELECT custkey, SUM(totalprice) FROM orders GROUP BY custkey Scan Scan Scan Read Read Read Disagg shuffle on custkey Read Aggr Aggr Aggr Aggr Stage 1 Stage 2
  • 11. Why Presto-on-Spark ▪ What are Missing? ▪ Full Disaggregated Shuffle ▪ Isolated Executor ▪ Different Scheduler, Speculative Execution, etc, ... ▪ Embed a “mini-Spark Runtime” inside Presto! Instead of Making Presto Unlimited More Scalable?
  • 13. Presto-on-Spark Design Principles ▪ Presto is run as a library ▪ Presto cluster is not needed to run Presto-on-Spark ▪ Presto on Spark is just a Spark application ▪ Query is passed as a parameter ▪ Implemented on RDD level ▪ Operations done by Presto are opaque to Spark engine spark-submit # spark-submit --master spark://spark-master:7077 presto-spark-launcher-*.jar --package presto-spark-package-*.tar.gz --config ./config.properties --catalogs ./catalogs --catalog hive --schema default --file /tmp/query.sql
  • 14. Planning Logical PlanQuery Distributed Plan SELECT * FROM lineitem l JOIN orders o ON l.orderkey = o.orderkey WHERE o.orderstatus = 'O' TABLE SCAN [lineitem] JOIN [on orderkey] TABLE SCAN [orders] FILTER [o.orderstatus = 'O'] Fragment 1 Fragment 2 Fragment 0 PARTITION BY [orderkey] PARTITION BY [orderkey] FILTER [o.orderstatus = 'O'] TABLE SCAN [orders] TABLE SCAN [lineitem] JOIN [on orderkey]
  • 15. Translating to RDD Fragment 1 Fragment 2 Fragment 0 PARTITION BY [orderkey] PARTITION BY [orderkey] FILTER [o.orderstatus = 'O'] TABLE SCAN [orders] TABLE SCAN [lineitem] JOIN [on orderkey] sparkContext .parallelize(lineitemSplits) PairRDD<Integer, Row> = rdd .mapPartitionsToPair( fragment1Processor) sparkContext .parallelize(ordersSplits) PairRDD<Integer, Row> = rdd .mapPartitionsToPair( fragment2Processor) pairRdd.partitionBy() pairRdd.partitionBy() lineitemRdd.zipPartitions(ordersRdd, fragment0Processor)
  • 17. Execution Fragment 2 FILTER [o.orderstatus = 'O'] TABLE SCAN [orders] Fragment 0 JOIN [on orderkey] Leaf Fragment Intermediate Fragment Iterator<Tuple2<Integer, PrestoSparkRow>> process(List<Split> splits) Iterator<Tuple2<Integer, PrestoSparkRow>> process( List<Iterator<Tuple2<Integer, PrestoSparkRow>>> inputs)
  • 18. Columnar Format to Row Format Conversion STAGE 1 INPUT OUTPUTPROJECT FILTERPAGE PAGE PAGEROW ROW STAGE 2 INPUT OUTPUT GROUP BY FILTERPAGE PAGE PAGEROW ROW COL 1 VAL 1 COL 1 VAL 2 COL 1 VAL 3 COL 1 VAL 4 COL 1 VAL 5 COL 2 VAL 1 COL 2 VAL 2 COL 2 VAL 3 COL 2 VAL 4 COL 2 VAL 5 COL 3 VAL 1 COL 3 VAL 2 COL 3 VAL 3 COL 3 VAL 4 COL 3 VAL 5 [COL 1 VAL 1], [COL 2 VAL 1], [COL 3 VAL 1] SHUFFLE
  • 19. Broadcast Join Distributed Plan TABLE SCAN [lineitem] JOIN [on orderkey] TABLE SCAN [orders] FILTER [o.orderstatus = 'O'] Logical Plan Fragment 1 Fragment 0 BROADCAST FILTER [o.orderstatus = 'O'] TABLE SCAN [orders] TABLE SCAN [lineitem] JOIN [on orderkey]
  • 20. Job 1 Job 0 Translating to RDD Fragment 1 Fragment 0 BROADCAST FILTER [o.orderstatus = 'O'] TABLE SCAN [orders] TABLE SCAN [lineitem] JOIN [on orderkey] sparkContext .parallelize(lineitemSplits) RDD<Row> = rdd .mapPartitions(fragment0Processor) sparkContext .parallelize(ordersSplits) RDD<Row> = rdd .mapPartitions(fragment1Processor) sc.broadcast(ordersRdd.collect())
  • 22. Execution Broadcast Fragment Join Fragment Fragment 1 FILTER [o.orderstatus = 'O'] TABLE SCAN [orders] Iterator<Tuple2<Integer, PrestoSparkRow>> process(List<Split> splits) Fragment 0 TABLE SCAN [lineitem] JOIN [on orderkey] Iterator<Tuple2<Integer, PrestoSparkRow>> process( List<Split> splits, List<Broadcast<List<PrestoSparkRow>>> broadcasts)
  • 23. Threading Model Presto Task Spark Task INPUT OUTPUTPROJECT UNNEST FILTER INPUT OUTPUTLOCAL SHUFFLE PROJECT UNNEST FILTERPROJECT PROJECT UNNEST FILTERPROJECT PROJECT UNNEST FILTERPROJECT PROJECT UNNEST FILTERPROJECT PROJECT
  • 24. Classloader Isolation Spark Classloader (presto-spark-launcher.jar) Presto Classloader (presto-spark-package.tar.gz) int main() { ... sparkContext.addFile( “presto-spark-package.tar.gz” ) ... ... IPrestoSparkService service = createService( “presto-spark-package.tar.gz” ) ... } Hive Plugin Classloader Pinot Plugin Classloader MySQL Plugin Classloader IPrestoSparkService { getQueryExecutionFactory(); getTaskExecutorFactory(); }
  • 25. Current Status ▪ Under Active Development on GitHub: #13856 ▪ Most query shapes supported ▪ Working on supporting remaining query shapes (some flavors of UNION ALL) ▪ Preparing the feature to become GA ▪ Initial Scalability Tests ▪ Scale to 10,000 Mappers / Reducers ▪ Supports Queries Require 50TB+ Distributed Memory in Presto ▪ Up to 3x Wall Time Reduction for Presto Large Batch Queries (6h in Presto vs 2h in Presto on Spark)
  • 26. Q&A