Optimizing S3 Write-heavy Spark workloads

Optimizing S3 Write-heavy Spark workloads
Apache Spark meetup, Qubole office, Bangalore
3rd March 2018
bharatb@qubole.com
Senior Engineering Manager, Spark team, Qubole

Context
● Cloud, object storage, ephemeral clusters, large writes
● df.write().save()
● df.write().saveAsTable()
● spark.sql(“INSERT INTO ….”)
● spark.sql(“INSERT OVERWRITE …”)
● spark.sql(“ALTER TABLE RECOVER PARTITIONS”)
● Changes in spark/hive itself rather than in user programs

Agenda
● Problems with S3 writes
● Spark writes
● Faster hive writes, iteration 1
● Faster hive writes, iteration 2
● Fault tolerant DFOC
● Faster recover partitions

Part 1: Problems with S3 writes

Problems when writing to S3: EC
- Eventual consistency problems
- HEAD (404) -> PUT -> GET
- PUT -> PUT -> GET
- PUT -> DELETE -> LIST-PARENT

Problems when writing to S3: Rename
Operation:
Rename
s3://bucket/x to
s3://bucket/y
Copy x to y Delete x
- Copy is slow and depends on file size
- Two calls needed

Problems when writing to S3: Failures
- Transient failures of S3 rest calls
- Throttling

Two kinds of tables
Write
Hive table
Datasource
table
Distributed
write to hive
staging dir
Hive.loadTable /
Hive.loadPartition
called to move data
to warehouse
Distributed
write to final
dest

Part 3: Faster hive table writes, iteration 1

Problem: loadPartition
Write
Hive table
Datasource
table
Distributed
write to hive
staging dir
Hive.loadTable /
Hive.loadPartition
called to move data
to warehouse
Distributed
write to final
dest

Problem: loadPartition is slow
- Hive.replaceFiles / Hive.copyFiles primitive is used to move
data from hive staging dir to warehouse dir
- Rename done in the hive operations is slow and serialized
- No retries to account for transient failures

Problem: loadPartition has EC issues
- EC issues during the copy/move
- Few files written to the hive staging directory may not appear in
the listing done on the driver during Hive.replaceFiles
- Few files deleted may appear in the listing (especially in FOC v1
case)

Solution
- Parallelize Hive.copyFiles/Hive.replaceFiles
- Changes in hive codebase
- Algorithm:
- listFiles(src); renameFiles(src, dest)
- loop-until-no-change
- listFiles(src)
- renameFiles(src, dest)

Solution: Robustness
- Listing related
- diff(oldListing, newListing)
- if new files appear, rename them in this iteration
- if existing files disappear, dont try to rename them
- Rename related
- if rename failed, try to rename them in next iteration

Solution: Performance
- Rename in parallel in a threadpool of 128 threads
- For INSERT INTO, find the N to use for file_copy_N, for all files
in the dest dir, in one shot
- Rename the biggest files to be first so that they don’t become
the long pole
- Rename the recently modified files last (FIFO on time) so that
they get time to vanish

Solution: Performance numbers
- INSERT OVERWRITE TABLE user PARTITION(date="2011")
SELECT userId, firstName, email FROM people
- For example, 100GB data spread over 10000 files
- Before optimization: 110 mins
- After optimization: 12 mins (not sensitive to file count)

Part 4: Faster hive table writes, iteration 2

Problem
- Can we write directly to the hive warehouse folder?
- Avoid hive staging-dir ?

Solution
Write
Hive table
Datasource
table
Distributed
write to hive
staging dir
Hive.loadTable /
Hive.loadPartition
called to move data
to warehouse
Distributed
write to final
dest

Solution: Algorithm
- InsertIntoHiveTable.run()
- if (useDirectWrites)
- InsertIntoHadoopFsRelationCommand
- FileFormatWriter.write(fileFormat)
- else // existing code
- FileFormatWriter.write(HiveFileFormat)
- Hive.loadTable / Hive.loadPartition

Solution: Write directly to the warehouse
- Use spark’s default write flow for hive tables also
- Avoid using staging_dir
- Uses whatever OutputCommitter which is active
- Changes in spark code base
- Cases: INSERT INTO/OVERWRITE + Static/dynamic partitions
- Except INSERT OVERWRITE involving dynamic partitions
- Con: Affects warehouse directory immediately on job start

Solution: Write directly to the warehouse
- Very good performance gains
- Hive.loadTable / Hive.loadPartition not needed
- Error recovery needs be done carefully
- On failure, delete all files s3://bucket/path/*/*/*<jobId>*

Solution: Performance
- Data: 142 GB (Records - 149994000, Partitions - 9000)
- Each partition had one file
- Direct writes disabled: 7 hr, 30 min
- Direct writes enabled: 24.5 mins
- Spark distributed write was fast in both cases. In the first case
extra move was needed.

DirectFileOutputCommitter (DFOC)
- Directly write to output location
- Pros: No EC, high performance
- Cons: Speculation and task retries will fail
- Cons: Output is visible before job finish

Problem
- If you use DFOC, any task failure will cause job failure
- Empty S3 file is created even on task failure
- Retry will always fail with FileAlreadyExistsException
- 7/08/16 00:33:55 task-result-getter-1 WARN TaskSetManager: Lost task 0.1 in stage 42.0 (TID 5782, 10.23.7.190, executor 10):
org.apache.hadoop.fs.FileAlreadyExistsException:
s3n://bucket/path/2017/08/15/23/part-00000-017681ee-5206-4163-b4a9-a29cf8a67ab4.json.gz already exists

Solution: Overwrite if file already exists
- fs.create(path, false) -> fs.create(path, true)
- Spark changes - different across versions
- Hive changes - orc
- Parquet changes

Part 6: Faster recover partitions

Problem:
- alter table recover partitions is slow
- Algorithm
- Generate list of all partitions and their statistics
- Add partitions to metastore
- Example: Two partition keys, 100 values each, 10k partitions in
total - takes close to (10+20) mins to recover partitions (spark
2.1.0)

Solution
- Use faster variant of S3 listing, prefix based
- 10 mins for gathering partitions and stats reduced to 10 secs
- Now total time is (10 secs + 20 mins), 33% improvement
- Spark only changes

Thank you - Q&A
- rohitk@qubole.com
- prakharj@qubole.com
- bharatb@qubole.com

Optimizing S3 Write-heavy Spark workloads

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Optimizing S3 Write-heavy Spark workloads

Similar to Optimizing S3 Write-heavy Spark workloads (20)

More from datamantra

More from datamantra (20)

Recently uploaded

Recently uploaded (20)

Optimizing S3 Write-heavy Spark workloads