Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0

Share

Degrading Performance? You Might be Suffering From the Small Files Syndrome

Download to read offline

No matter if your data pipelines are handling real-time event-driven streams, near-real-time streams, or batch processing jobs. When you work with a massive amount of data made out of small files, specifically parquet, your system performance will degrade.

A small file is one that is significantly smaller than the storage block size. Yes, even with object stores such as Amazon S3, Azure Blob, etc., there is minimum block size. Having a significantly smaller object file can result in wasted space on the disk since the storage is optimized to support fast read and write for minimal block size.

To understand why this happens, you need first to understand how cloud storage works with the Apache Spark engine. In this session, you will learn about Parquet, the Storage API calls, how they work together, why small files are a problem, and how you can leverage DeltaLake for a more straightforward, cleaner solution.

  • Be the first to like this

Degrading Performance? You Might be Suffering From the Small Files Syndrome

  1. 1. Photo by Priscilla Du Preez on Unsplash
  2. 2. Animation by Mike Mk and lottiefiles: https://lottiefiles.com/user/775169
  3. 3. Failed Tasks in Spark UI - Executers @adipolak
  4. 4. Client-Request-ID=------ Retry policy did not allow for a retry: , HTTP status code=Unknown, Exception=HTTPSConnectionPool(host='-----.net', port=443): Max retries exceeded with url: /xxxxxxx?restype=container&comp=list (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at xxxxxxxx>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',)). HTTPSConnectionPool(ho st='your_account.blob.cor e.windows.net', port=443): Read timed out. (read timeout=[your timeout]) Exceptions in Apache Spark Executers logs @adipolak
  5. 5. On-prem Public Cloud
  6. 6. Degrading Performance? You might be suffering from the Small Files Syndrome Adi Polak Microsoft @adipolak
  7. 7. About Me M.Sc & B.Sc - BGU University ML Researcher @ DT &BGU Cyber Security lab Sr. Big Data Engineer @ Akamai Sr. Software Developer & Cloud Advocate @ Microsoft @adipolak https://www.linkedin.com/in/adi -polak-68548365/
  8. 8. Agenda § The Problem § Why it Happens § Detect and Mitigate § Delta Lake vs Parquet Demo @adipolak
  9. 9. Why it happens? @adipolak
  10. 10. Query Life Cycle Abstraction storage storage storage @adipolak
  11. 11. Query Life Cycle Abstraction storage @adipolak
  12. 12. How Read and Write works? @adipolak
  13. 13. File size matters! @adipolak 1 Million files of 60 bytes ~ 0.06 GB ~ Reading == 1 M RPCs 1 file of 0.06 GB ~ 60Mb ~ Reading == 1 RPCs
  14. 14. Detect and Mitigate? @adipolak
  15. 15. Where can it happen? • Event streams • IoT devices, servers, or applications are being translated into KB-scale JSON files during the ingestion procedure • Over Paralleled Apache Spark jobs Sub-bullet • Over Partitioned Hive tables @adipolak
  16. 16. What to check? • Data skew - Hive partitions file sizes • Spark job writers in the Spark History Server UI • Ingestion file size @adipolak
  17. 17. Mitigate • Use file hierarchy - source/api_type/yyyy/mm/dd/hh/mm • design partitions w/ usage in mind • Re-partition vs Coalesce • Databricks Auto Optimize • SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true) • Delta Lake Optimize performance • Compaction (bin-packing) • ZORDER BY • delta.targetFileSize • delta.tuneFileSizesForRewrites @adipolak
  18. 18. Demo – optimizing read queries @adipolak
  19. 19. @adipolak
  20. 20. Summary § The Problem § Why it Happens § Detect and Mitigate § Delta Lake vs Parquet Demo @adipolak
  21. 21. “Intellectual growth should commence at birth and cease only at death.” ― Albert Einstein @adipolak

No matter if your data pipelines are handling real-time event-driven streams, near-real-time streams, or batch processing jobs. When you work with a massive amount of data made out of small files, specifically parquet, your system performance will degrade. A small file is one that is significantly smaller than the storage block size. Yes, even with object stores such as Amazon S3, Azure Blob, etc., there is minimum block size. Having a significantly smaller object file can result in wasted space on the disk since the storage is optimized to support fast read and write for minimal block size. To understand why this happens, you need first to understand how cloud storage works with the Apache Spark engine. In this session, you will learn about Parquet, the Storage API calls, how they work together, why small files are a problem, and how you can leverage DeltaLake for a more straightforward, cleaner solution.

Views

Total views

90

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

3

Shares

0

Comments

0

Likes

0

×