SlideShare a Scribd company logo
Optimizing S3 Write-heavy Spark workloads
Apache Spark meetup, Qubole office, Bangalore
3rd March 2018
bharatb@qubole.com
Senior Engineering Manager, Spark team, Qubole
Context
● Cloud, object storage, ephemeral clusters, large writes
● df.write().save()
● df.write().saveAsTable()
● spark.sql(“INSERT INTO ….”)
● spark.sql(“INSERT OVERWRITE …”)
● spark.sql(“ALTER TABLE RECOVER PARTITIONS”)
● Changes in spark/hive itself rather than in user programs
Agenda
● Problems with S3 writes
● Spark writes
● Faster hive writes, iteration 1
● Faster hive writes, iteration 2
● Fault tolerant DFOC
● Faster recover partitions
Part 1: Problems with S3 writes
Problems when writing to S3: EC
- Eventual consistency problems
- HEAD (404) -> PUT -> GET
- PUT -> PUT -> GET
- PUT -> DELETE -> LIST-PARENT
Problems when writing to S3: Rename
Operation:
Rename
s3://bucket/x to
s3://bucket/y
Copy x to y Delete x
- Copy is slow and depends on file size
- Two calls needed
Problems when writing to S3: Failures
- Transient failures of S3 rest calls
- Throttling
Part 2: Spark writes
Two kinds of tables
Write
Hive table
Datasource
table
Distributed
write to hive
staging dir
Hive.loadTable /
Hive.loadPartition
called to move data
to warehouse
Distributed
write to final
dest
Part 3: Faster hive table writes, iteration 1
Problem: loadPartition
Write
Hive table
Datasource
table
Distributed
write to hive
staging dir
Hive.loadTable /
Hive.loadPartition
called to move data
to warehouse
Distributed
write to final
dest
Problem: loadPartition is slow
- Hive.replaceFiles / Hive.copyFiles primitive is used to move
data from hive staging dir to warehouse dir
- Rename done in the hive operations is slow and serialized
- No retries to account for transient failures
Problem: loadPartition has EC issues
- EC issues during the copy/move
- Few files written to the hive staging directory may not appear in
the listing done on the driver during Hive.replaceFiles
- Few files deleted may appear in the listing (especially in FOC v1
case)
Solution
- Parallelize Hive.copyFiles/Hive.replaceFiles
- Changes in hive codebase
- Algorithm:
- listFiles(src); renameFiles(src, dest)
- loop-until-no-change
- listFiles(src)
- renameFiles(src, dest)
Solution: Robustness
- Listing related
- diff(oldListing, newListing)
- if new files appear, rename them in this iteration
- if existing files disappear, dont try to rename them
- Rename related
- if rename failed, try to rename them in next iteration
Solution: Performance
- Rename in parallel in a threadpool of 128 threads
- For INSERT INTO, find the N to use for file_copy_N, for all files
in the dest dir, in one shot
- Rename the biggest files to be first so that they don’t become
the long pole
- Rename the recently modified files last (FIFO on time) so that
they get time to vanish
Solution: Performance numbers
- INSERT OVERWRITE TABLE user PARTITION(date="2011")
SELECT userId, firstName, email FROM people
- For example, 100GB data spread over 10000 files
- Before optimization: 110 mins
- After optimization: 12 mins (not sensitive to file count)
Part 4: Faster hive table writes, iteration 2
Problem
- Can we write directly to the hive warehouse folder?
- Avoid hive staging-dir ?
Solution
Write
Hive table
Datasource
table
Distributed
write to hive
staging dir
Hive.loadTable /
Hive.loadPartition
called to move data
to warehouse
Distributed
write to final
dest
Solution: Algorithm
- InsertIntoHiveTable.run()
- if (useDirectWrites)
- InsertIntoHadoopFsRelationCommand
- FileFormatWriter.write(fileFormat)
- else // existing code
- FileFormatWriter.write(HiveFileFormat)
- Hive.loadTable / Hive.loadPartition
Solution: Write directly to the warehouse
- Use spark’s default write flow for hive tables also
- Avoid using staging_dir
- Uses whatever OutputCommitter which is active
- Changes in spark code base
- Cases: INSERT INTO/OVERWRITE + Static/dynamic partitions
- Except INSERT OVERWRITE involving dynamic partitions
- Con: Affects warehouse directory immediately on job start
Solution: Write directly to the warehouse
- Very good performance gains
- Hive.loadTable / Hive.loadPartition not needed
- Error recovery needs be done carefully
- On failure, delete all files s3://bucket/path/*/*/*<jobId>*
Solution: Performance
- Data: 142 GB (Records - 149994000, Partitions - 9000)
- Each partition had one file
- Direct writes disabled: 7 hr, 30 min
- Direct writes enabled: 24.5 mins
- Spark distributed write was fast in both cases. In the first case
extra move was needed.
Part 5: Fault tolerant DFOC
DirectFileOutputCommitter (DFOC)
- Directly write to output location
- Pros: No EC, high performance
- Cons: Speculation and task retries will fail
- Cons: Output is visible before job finish
Problem
- If you use DFOC, any task failure will cause job failure
- Empty S3 file is created even on task failure
- Retry will always fail with FileAlreadyExistsException
- 7/08/16 00:33:55 task-result-getter-1 WARN TaskSetManager: Lost task 0.1 in stage 42.0 (TID 5782, 10.23.7.190, executor 10):
org.apache.hadoop.fs.FileAlreadyExistsException:
s3n://bucket/path/2017/08/15/23/part-00000-017681ee-5206-4163-b4a9-a29cf8a67ab4.json.gz already exists
Solution: Overwrite if file already exists
- fs.create(path, false) -> fs.create(path, true)
- Spark changes - different across versions
- Hive changes - orc
- Parquet changes
Part 6: Faster recover partitions
Problem:
- alter table recover partitions is slow
- Algorithm
- Generate list of all partitions and their statistics
- Add partitions to metastore
- Example: Two partition keys, 100 values each, 10k partitions in
total - takes close to (10+20) mins to recover partitions (spark
2.1.0)
Solution
- Use faster variant of S3 listing, prefix based
- 10 mins for gathering partitions and stats reduced to 10 secs
- Now total time is (10 secs + 20 mins), 33% improvement
- Spark only changes
Thank you - Q&A
- rohitk@qubole.com
- prakharj@qubole.com
- bharatb@qubole.com

More Related Content

What's hot

Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 

What's hot (20)

Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 

Similar to Optimizing S3 Write-heavy Spark workloads

Managing ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkManaging ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache Spark
Databricks
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
Dori Waldman
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
Understanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New YorkUnderstanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New York
Rachel Warren
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New York
Holden Karau
 
Spark Autotuning - Strata EU 2018
Spark Autotuning - Strata EU 2018Spark Autotuning - Strata EU 2018
Spark Autotuning - Strata EU 2018
Holden Karau
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk final
Rachel Warren
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
Prasanna Rajaperumal
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
Hyderabad Scalability Meetup
 
FireWorks workflow software
FireWorks workflow softwareFireWorks workflow software
FireWorks workflow software
Anubhav Jain
 
Speed it up and Spark it up at Intel
Speed it up and Spark it up at IntelSpeed it up and Spark it up at Intel
Speed it up and Spark it up at Intel
DataWorks Summit
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
Speed up large-scale ML/DL offline inference job with Alluxio
Speed up large-scale ML/DL offline inference job with AlluxioSpeed up large-scale ML/DL offline inference job with Alluxio
Speed up large-scale ML/DL offline inference job with Alluxio
Alluxio, Inc.
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
prevota
 
Spark 101
Spark 101Spark 101
Spark 101
Mohit Garg
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
Spark Summit
 
Oracle11g notes
Oracle11g notesOracle11g notes
Oracle11g notes
Manish Mudhliyar
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
Speck&Tech
 
Handout3o
Handout3oHandout3o
Handout3o
Shahbaz Sidhu
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
Vadim Semenov
 

Similar to Optimizing S3 Write-heavy Spark workloads (20)

Managing ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkManaging ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache Spark
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
 
Understanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New YorkUnderstanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New York
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New York
 
Spark Autotuning - Strata EU 2018
Spark Autotuning - Strata EU 2018Spark Autotuning - Strata EU 2018
Spark Autotuning - Strata EU 2018
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk final
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
FireWorks workflow software
FireWorks workflow softwareFireWorks workflow software
FireWorks workflow software
 
Speed it up and Spark it up at Intel
Speed it up and Spark it up at IntelSpeed it up and Spark it up at Intel
Speed it up and Spark it up at Intel
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Speed up large-scale ML/DL offline inference job with Alluxio
Speed up large-scale ML/DL offline inference job with AlluxioSpeed up large-scale ML/DL offline inference job with Alluxio
Speed up large-scale ML/DL offline inference job with Alluxio
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
 
Spark 101
Spark 101Spark 101
Spark 101
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Oracle11g notes
Oracle11g notesOracle11g notes
Oracle11g notes
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
 
Handout3o
Handout3oHandout3o
Handout3o
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
 

More from datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
datamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
datamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
datamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
datamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
datamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
datamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
datamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
datamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
datamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
datamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
datamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
datamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
datamantra
 

More from datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 

Recently uploaded

PTT of AI Bots, Avatar, business continuity software.
PTT of AI Bots, Avatar, business continuity software.PTT of AI Bots, Avatar, business continuity software.
PTT of AI Bots, Avatar, business continuity software.
arash8484
 
FINAL PROJECT WORK PORTFOLIO MANAGEMENT (2) hhh (1) (2) (5) (1) (1).pdf
FINAL PROJECT WORK PORTFOLIO MANAGEMENT (2)  hhh (1) (2) (5) (1) (1).pdfFINAL PROJECT WORK PORTFOLIO MANAGEMENT (2)  hhh (1) (2) (5) (1) (1).pdf
FINAL PROJECT WORK PORTFOLIO MANAGEMENT (2) hhh (1) (2) (5) (1) (1).pdf
bala krishna
 
Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
arti singh$A17
 
Histology of Muscle types histology o.ppt
Histology of Muscle types histology o.pptHistology of Muscle types histology o.ppt
Histology of Muscle types histology o.ppt
SamanArshad11
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
weiwchu
 
Cyber Insurance Mathematical Model & Pricing
Cyber Insurance Mathematical Model & PricingCyber Insurance Mathematical Model & Pricing
Cyber Insurance Mathematical Model & Pricing
BaraDaniel1
 
Semantic Web and organizational data .pptx
Semantic Web and organizational data .pptxSemantic Web and organizational data .pptx
Semantic Web and organizational data .pptx
Kanchana Weerasinghe
 
VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...
VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...
VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...
Ak47
 
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
rightmanforbloodline
 
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
satpalsheravatmumbai
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
lenjisoHussein
 
Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...
Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...
Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...
sheetal singh$A17
 
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
revolutionary575
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
Sathiakumar Chandr
 
History and Application of LLM Leveraging Big Data
History and Application of LLM Leveraging Big DataHistory and Application of LLM Leveraging Big Data
History and Application of LLM Leveraging Big Data
Jongwook Woo
 
Celonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptxCelonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptx
AnujaGaikwad28
 
Practical Research for grade 12 students
Practical Research for grade 12 studentsPractical Research for grade 12 students
Practical Research for grade 12 students
juliaaaaana10
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Samuel Jackson
 
Self-healing Security Systems - CloudIOTEnterpriseSystems-Group5.pptx
Self-healing Security Systems - CloudIOTEnterpriseSystems-Group5.pptxSelf-healing Security Systems - CloudIOTEnterpriseSystems-Group5.pptx
Self-healing Security Systems - CloudIOTEnterpriseSystems-Group5.pptx
BiplabRoy71
 
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdfWhy_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Alexander Teggin
 

Recently uploaded (20)

PTT of AI Bots, Avatar, business continuity software.
PTT of AI Bots, Avatar, business continuity software.PTT of AI Bots, Avatar, business continuity software.
PTT of AI Bots, Avatar, business continuity software.
 
FINAL PROJECT WORK PORTFOLIO MANAGEMENT (2) hhh (1) (2) (5) (1) (1).pdf
FINAL PROJECT WORK PORTFOLIO MANAGEMENT (2)  hhh (1) (2) (5) (1) (1).pdfFINAL PROJECT WORK PORTFOLIO MANAGEMENT (2)  hhh (1) (2) (5) (1) (1).pdf
FINAL PROJECT WORK PORTFOLIO MANAGEMENT (2) hhh (1) (2) (5) (1) (1).pdf
 
Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
 
Histology of Muscle types histology o.ppt
Histology of Muscle types histology o.pptHistology of Muscle types histology o.ppt
Histology of Muscle types histology o.ppt
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
 
Cyber Insurance Mathematical Model & Pricing
Cyber Insurance Mathematical Model & PricingCyber Insurance Mathematical Model & Pricing
Cyber Insurance Mathematical Model & Pricing
 
Semantic Web and organizational data .pptx
Semantic Web and organizational data .pptxSemantic Web and organizational data .pptx
Semantic Web and organizational data .pptx
 
VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...
VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...
VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...
 
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
 
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
 
Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...
Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...
Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...
 
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
 
History and Application of LLM Leveraging Big Data
History and Application of LLM Leveraging Big DataHistory and Application of LLM Leveraging Big Data
History and Application of LLM Leveraging Big Data
 
Celonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptxCelonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptx
 
Practical Research for grade 12 students
Practical Research for grade 12 studentsPractical Research for grade 12 students
Practical Research for grade 12 students
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
 
Self-healing Security Systems - CloudIOTEnterpriseSystems-Group5.pptx
Self-healing Security Systems - CloudIOTEnterpriseSystems-Group5.pptxSelf-healing Security Systems - CloudIOTEnterpriseSystems-Group5.pptx
Self-healing Security Systems - CloudIOTEnterpriseSystems-Group5.pptx
 
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdfWhy_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
 

Optimizing S3 Write-heavy Spark workloads

  • 1. Optimizing S3 Write-heavy Spark workloads Apache Spark meetup, Qubole office, Bangalore 3rd March 2018 bharatb@qubole.com Senior Engineering Manager, Spark team, Qubole
  • 2. Context ● Cloud, object storage, ephemeral clusters, large writes ● df.write().save() ● df.write().saveAsTable() ● spark.sql(“INSERT INTO ….”) ● spark.sql(“INSERT OVERWRITE …”) ● spark.sql(“ALTER TABLE RECOVER PARTITIONS”) ● Changes in spark/hive itself rather than in user programs
  • 3. Agenda ● Problems with S3 writes ● Spark writes ● Faster hive writes, iteration 1 ● Faster hive writes, iteration 2 ● Fault tolerant DFOC ● Faster recover partitions
  • 4. Part 1: Problems with S3 writes
  • 5. Problems when writing to S3: EC - Eventual consistency problems - HEAD (404) -> PUT -> GET - PUT -> PUT -> GET - PUT -> DELETE -> LIST-PARENT
  • 6. Problems when writing to S3: Rename Operation: Rename s3://bucket/x to s3://bucket/y Copy x to y Delete x - Copy is slow and depends on file size - Two calls needed
  • 7. Problems when writing to S3: Failures - Transient failures of S3 rest calls - Throttling
  • 8. Part 2: Spark writes
  • 9. Two kinds of tables Write Hive table Datasource table Distributed write to hive staging dir Hive.loadTable / Hive.loadPartition called to move data to warehouse Distributed write to final dest
  • 10. Part 3: Faster hive table writes, iteration 1
  • 11. Problem: loadPartition Write Hive table Datasource table Distributed write to hive staging dir Hive.loadTable / Hive.loadPartition called to move data to warehouse Distributed write to final dest
  • 12. Problem: loadPartition is slow - Hive.replaceFiles / Hive.copyFiles primitive is used to move data from hive staging dir to warehouse dir - Rename done in the hive operations is slow and serialized - No retries to account for transient failures
  • 13. Problem: loadPartition has EC issues - EC issues during the copy/move - Few files written to the hive staging directory may not appear in the listing done on the driver during Hive.replaceFiles - Few files deleted may appear in the listing (especially in FOC v1 case)
  • 14. Solution - Parallelize Hive.copyFiles/Hive.replaceFiles - Changes in hive codebase - Algorithm: - listFiles(src); renameFiles(src, dest) - loop-until-no-change - listFiles(src) - renameFiles(src, dest)
  • 15. Solution: Robustness - Listing related - diff(oldListing, newListing) - if new files appear, rename them in this iteration - if existing files disappear, dont try to rename them - Rename related - if rename failed, try to rename them in next iteration
  • 16. Solution: Performance - Rename in parallel in a threadpool of 128 threads - For INSERT INTO, find the N to use for file_copy_N, for all files in the dest dir, in one shot - Rename the biggest files to be first so that they don’t become the long pole - Rename the recently modified files last (FIFO on time) so that they get time to vanish
  • 17. Solution: Performance numbers - INSERT OVERWRITE TABLE user PARTITION(date="2011") SELECT userId, firstName, email FROM people - For example, 100GB data spread over 10000 files - Before optimization: 110 mins - After optimization: 12 mins (not sensitive to file count)
  • 18. Part 4: Faster hive table writes, iteration 2
  • 19. Problem - Can we write directly to the hive warehouse folder? - Avoid hive staging-dir ?
  • 20. Solution Write Hive table Datasource table Distributed write to hive staging dir Hive.loadTable / Hive.loadPartition called to move data to warehouse Distributed write to final dest
  • 21. Solution: Algorithm - InsertIntoHiveTable.run() - if (useDirectWrites) - InsertIntoHadoopFsRelationCommand - FileFormatWriter.write(fileFormat) - else // existing code - FileFormatWriter.write(HiveFileFormat) - Hive.loadTable / Hive.loadPartition
  • 22. Solution: Write directly to the warehouse - Use spark’s default write flow for hive tables also - Avoid using staging_dir - Uses whatever OutputCommitter which is active - Changes in spark code base - Cases: INSERT INTO/OVERWRITE + Static/dynamic partitions - Except INSERT OVERWRITE involving dynamic partitions - Con: Affects warehouse directory immediately on job start
  • 23. Solution: Write directly to the warehouse - Very good performance gains - Hive.loadTable / Hive.loadPartition not needed - Error recovery needs be done carefully - On failure, delete all files s3://bucket/path/*/*/*<jobId>*
  • 24. Solution: Performance - Data: 142 GB (Records - 149994000, Partitions - 9000) - Each partition had one file - Direct writes disabled: 7 hr, 30 min - Direct writes enabled: 24.5 mins - Spark distributed write was fast in both cases. In the first case extra move was needed.
  • 25. Part 5: Fault tolerant DFOC
  • 26. DirectFileOutputCommitter (DFOC) - Directly write to output location - Pros: No EC, high performance - Cons: Speculation and task retries will fail - Cons: Output is visible before job finish
  • 27. Problem - If you use DFOC, any task failure will cause job failure - Empty S3 file is created even on task failure - Retry will always fail with FileAlreadyExistsException - 7/08/16 00:33:55 task-result-getter-1 WARN TaskSetManager: Lost task 0.1 in stage 42.0 (TID 5782, 10.23.7.190, executor 10): org.apache.hadoop.fs.FileAlreadyExistsException: s3n://bucket/path/2017/08/15/23/part-00000-017681ee-5206-4163-b4a9-a29cf8a67ab4.json.gz already exists
  • 28. Solution: Overwrite if file already exists - fs.create(path, false) -> fs.create(path, true) - Spark changes - different across versions - Hive changes - orc - Parquet changes
  • 29. Part 6: Faster recover partitions
  • 30. Problem: - alter table recover partitions is slow - Algorithm - Generate list of all partitions and their statistics - Add partitions to metastore - Example: Two partition keys, 100 values each, 10k partitions in total - takes close to (10+20) mins to recover partitions (spark 2.1.0)
  • 31. Solution - Use faster variant of S3 listing, prefix based - 10 mins for gathering partitions and stats reduced to 10 secs - Now total time is (10 secs + 20 mins), 33% improvement - Spark only changes
  • 32. Thank you - Q&A - rohitk@qubole.com - prakharj@qubole.com - bharatb@qubole.com