Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
How we optimize
Spark SQL jobs with
parallel and
asynchronous I/O
Guo, Jun (jason.guo.vip@gmail.com)
Lead of Data Engine T...
Who we are
▪ Data Engine team at
ByteDance
▪ Build a platform of
one-stop experience for
OLAP , on which users can
analyze...
What we do
▪ Manage Spark SQL /
Presto / Hive workloads
▪ Offer Open API and
self-serve platform
▪ Optimize Spark SQL /
Pr...
Agenda
• Spark SQL at ByteDance
• Why does I/O matter for Spark SQL
• How we boost Spark SQL jobs by parallel and
asynchro...
Spark SQL at ByteDance
Spark SQL at ByteDance
2016 2017 2018 2019 2020
Small Scale Experiments
Ad-hoc workload
Few ETL pipelines in production
Fu...
Why does I/O matter for Spark SQL
▪ NVMe SSD perform better than HDD
by two magnitude
▪ More and more new hardware have
been invented in past years, such as...
How we boost Spark SQL jobs by
parallel and asynchronous IO
Parquet -- Columnar Storage Format
Parallel IO
• Spark SQL will split a large
Parquet file into a group of
splits, each of which contains
one or a few row gro...
Parallel IO
• Spark SQL can combine a
group of small parquet files
into a single split
• Each task will read these files
in ...
Parallel I/O
▪ I/O and computation are handled
sequentially by the same thread
▪ Tuples in a single task are computed
sequ...
Parallel I/O
File level parallel I/O
Row Group level parallel I/O
Parallel I/O
• Column level parallel I/O
o Split a logical Parquet file into a
group of column family, which is a
physical ...
Asynchronous I/O
The future work
The future work
• I/O
• Adaptive column family
• Smart cache
• Computation
• Vectorized computation
• Native engine
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
Upcoming SlideShare
Loading in …5
×

of

How We Optimize Spark SQL Jobs With parallel and sync IO Slide 1 How We Optimize Spark SQL Jobs With parallel and sync IO Slide 2 How We Optimize Spark SQL Jobs With parallel and sync IO Slide 3 How We Optimize Spark SQL Jobs With parallel and sync IO Slide 4 How We Optimize Spark SQL Jobs With parallel and sync IO Slide 5 How We Optimize Spark SQL Jobs With parallel and sync IO Slide 6 How We Optimize Spark SQL Jobs With parallel and sync IO Slide 7 How We Optimize Spark SQL Jobs With parallel and sync IO Slide 8 How We Optimize Spark SQL Jobs With parallel and sync IO Slide 9 How We Optimize Spark SQL Jobs With parallel and sync IO Slide 10 How We Optimize Spark SQL Jobs With parallel and sync IO Slide 11 How We Optimize Spark SQL Jobs With parallel and sync IO Slide 12 How We Optimize Spark SQL Jobs With parallel and sync IO Slide 13 How We Optimize Spark SQL Jobs With parallel and sync IO Slide 14 How We Optimize Spark SQL Jobs With parallel and sync IO Slide 15 How We Optimize Spark SQL Jobs With parallel and sync IO Slide 16 How We Optimize Spark SQL Jobs With parallel and sync IO Slide 17 How We Optimize Spark SQL Jobs With parallel and sync IO Slide 18 How We Optimize Spark SQL Jobs With parallel and sync IO Slide 19
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

How We Optimize Spark SQL Jobs With parallel and sync IO

Download to read offline

Although NVMe has been more and more popular these years, a large amount of HDD are still widely used in super-large scale big data clusters. In a EB-level data platform, IO(including decompression and decode) cost contributes a large proportion of Spark jobs’ cost. In another word, IO operation is worth optimizing.

In ByteDancen, we do a series of IO optimization to improve performance, including parallel read and asynchronized shuffle. Firstly we implement file level parallel read to improve performance when there are a lot of small files. Secondly, we design row group level parallel read to accelerate queries for big-file scenario. Thirdly, implement asynchronized spill to improve job peformance. Besides, we design parquet column family, which will split a table into a few column families and different column family will be in different Parquets files. Different column family can be read in parallel, so the read performance is much higher than the existing approach. In our practice, the end to end performance is improved by 5% to 30%

In this talk, I will illustrate how we implement these features and how they accelerate Apache Spark jobs.

  • Be the first to like this

How We Optimize Spark SQL Jobs With parallel and sync IO

  1. 1. How we optimize Spark SQL jobs with parallel and asynchronous I/O Guo, Jun (jason.guo.vip@gmail.com) Lead of Data Engine Team, ByteDance
  2. 2. Who we are ▪ Data Engine team at ByteDance ▪ Build a platform of one-stop experience for OLAP , on which users can analyze PB level data by writing SQL without caring about the underlying execution engine
  3. 3. What we do ▪ Manage Spark SQL / Presto / Hive workloads ▪ Offer Open API and self-serve platform ▪ Optimize Spark SQL / Presto / Hive engine ▪ Design data architecture for most business lines in ByteDance
  4. 4. Agenda • Spark SQL at ByteDance • Why does I/O matter for Spark SQL • How we boost Spark SQL jobs by parallel and asynchronous I/O • Prospects
  5. 5. Spark SQL at ByteDance
  6. 6. Spark SQL at ByteDance 2016 2017 2018 2019 2020 Small Scale Experiments Ad-hoc workload Few ETL pipelines in production Full-production deployment Main engine in DW area 2021 Totally replace Hive for ETL
  7. 7. Why does I/O matter for Spark SQL
  8. 8. ▪ NVMe SSD perform better than HDD by two magnitude ▪ More and more new hardware have been invented in past years, such as AEP ▪ Many papers show that ‘I/O is faster than CPU’ ▪ TCO is one of the most important factors for huge data storage ▪ Most of servers have a lot of HDD, especially for Hadoop cluster ▪ I/O cost contribute more that 30% of total latency of Spark ETL jobs I/O is still the bottleneck for big data processing I/O performance has been improved Why does I/O matter for Spark SQL
  9. 9. How we boost Spark SQL jobs by parallel and asynchronous IO
  10. 10. Parquet -- Columnar Storage Format
  11. 11. Parallel IO • Spark SQL will split a large Parquet file into a group of splits, each of which contains one or a few row groups • Each task will read these row group sequentially
  12. 12. Parallel IO • Spark SQL can combine a group of small parquet files into a single split • Each task will read these files in a single group sequentially
  13. 13. Parallel I/O ▪ I/O and computation are handled sequentially by the same thread ▪ Tuples in a single task are computed sequentially ▪ I/O for different files or row groups are handled sequentially ▪ Introduce a buffer to separate I/O and computation ▪ I/O and computation will be handled in separated threads ▪ I/O for different files or row groups can be done in a parallel approach I/O and computation in separated threads I/O and computation in a single thread
  14. 14. Parallel I/O File level parallel I/O Row Group level parallel I/O
  15. 15. Parallel I/O • Column level parallel I/O o Split a logical Parquet file into a group of column family, which is a physical Parquet file o Each column family contains a few columns o Spark SQL will read different column family in parallel
  16. 16. Asynchronous I/O
  17. 17. The future work
  18. 18. The future work • I/O • Adaptive column family • Smart cache • Computation • Vectorized computation • Native engine
  19. 19. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Although NVMe has been more and more popular these years, a large amount of HDD are still widely used in super-large scale big data clusters. In a EB-level data platform, IO(including decompression and decode) cost contributes a large proportion of Spark jobs’ cost. In another word, IO operation is worth optimizing. In ByteDancen, we do a series of IO optimization to improve performance, including parallel read and asynchronized shuffle. Firstly we implement file level parallel read to improve performance when there are a lot of small files. Secondly, we design row group level parallel read to accelerate queries for big-file scenario. Thirdly, implement asynchronized spill to improve job peformance. Besides, we design parquet column family, which will split a table into a few column families and different column family will be in different Parquets files. Different column family can be read in parallel, so the read performance is much higher than the existing approach. In our practice, the end to end performance is improved by 5% to 30% In this talk, I will illustrate how we implement these features and how they accelerate Apache Spark jobs.

Views

Total views

219

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

20

Shares

0

Comments

0

Likes

0

×