Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Spark's Built-in File Sources in Depth

In Spark 3.0 releases, all the built-in file source connectors [including Parquet, ORC, JSON, Avro, CSV, Text] are re-implemented using the new data source API V2. We will give a technical overview of how Spark reads and writes these file formats based on the user-specified data layouts. The talk will also explain the differences between Hive Serde and native connectors, and share the experiences of how to tune the connectors and choose the best data layouts for achieving the best performance.

  • Login to see the comments

Apache Spark's Built-in File Sources in Depth

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Gengliang Wang, Databricks Apache Spark’s Built-in File Sources in Depth #UnifiedDataAnalytics #SparkAISummit
  3. 3. About me • Gengliang Wang(Github:gengliangwang) • Software Engineer at • Apache Spark committer 3#UnifiedDataAnalytics #SparkAISummit
  4. 4. Agenda • File formats • Data layout • File reader internals • File writer internals 4
  5. 5. 5
  6. 6. Built-in File formats • Column-oriented – Parquet – ORC • Row-oriented – Avro – JSON – CSV – Text – Binary 6
  7. 7. Column vs. Row Example table “music” 7 artist genre year album Elvis Presley Country 1965 Love Me Tender The Beatles Rock 1966 Revolver Nirvana Rock 1991 Nevermind
  8. 8. Column vs. Row Column Elvis Presley, The Beatles, Nirvana Country, Rock, Rock 1965, 1966, 1991 Love Me Tender, Revolver, Nevermind Row Elvis Presley, Country, 1965, Love Me Tender The Beatles, Rock, 1966, Revolver Nirvana, Rock, 1991, Nevermind 8 artist genre year album
  9. 9. Column-oriented formats Column Elvis Presley, The Beatles, Nirvana Country, Rock, Rock 1965, 1966, 1991 Love Me Tender, Revolver, Nevermind 9 Artist Genre Year Album SELECT album FROM music WHERE artist = 'The Beatles';
  10. 10. Row-oriented formats 10 artist genre year album SELECT * FROM music; INSERT INTO music ... Row Elvis Presley, Country, 1965, Love Me Tender The Beatles, Rock, 1966, Revolver Nirvana, Rock, 1991, Nevermind
  11. 11. Column vs. Row Column • Read heavy workloads • Queries that require only a subset of the table columns Row • Write heavy workloads – event level data • Queries that require most or all of the table columns 11
  12. 12. Built-in File formats • Column-oriented – Parquet – ORC • Row-oriented – Avro – JSON – CSV – Text – Binary 12
  13. 13. Parquet 13 File Structure • A list of row groups • Footer – Metadata of each row group • Min/max of each column – Schema
  14. 14. Parquet 14 Row group • A list of column trunks – One column chunk per column Column chunk • A list of data pages • An optional dictionary page
  15. 15. Parquet 15 Predicates pushed down • a = 1 • a < 1 • a > 1 • … Row group skipping • Footer stats (min/max) • Dictionary pages of column chunks
  16. 16. ORC 16 File Structure • A list of stripes • Footer – column-level aggregates count, min, max, and sum – The number of rows per stripe
  17. 17. ORC 17 Stripe Structure • Indexes – Start positions of each column – Min/max of each column • Data Supports Stripe skipping
  18. 18. Built-in File formats • Column-oriented – Parquet – ORC • Row-oriented – Avro – JSON – CSV – Text – Binary 18
  19. 19. Avro • Compact • Fast • Robust support for schema evolution – Adding fields – Removing fields – Renaming fields(with aliases) – Change default values – Change docs of fields 19
  20. 20. Semi-structured text formats: JSON & CSV • Excellent write path performance but slow on the read path – Good candidates for collecting raw data (e.g., logs) • Subject to inconsistent and/or malformed records • Schema inference provided by Spark – Sampling-based – Handy for exploratory scenario but can be inaccurate 20
  21. 21. JSON • JSON object: map or struct? – Spark schema inference always treats JSON objects as structs – Watch out for arbitrary number of keys (may OOM executors) – Specify an accurate schema if you decide to stick with maps 21
  22. 22. Avro vs. JSON Avro • Compact and Fast • Accurate schema inference • High data quality and robust schema revolution support, good choice for organizational scale development – Event data from Web/iOS/Android clients JSON • Repeating every field name with every single record • Schema inference can inaccurate • Light weight, easy development and debug 22
  23. 23. CSV • Often used for handling legacy data providers & consumers – Lacks of a standard file specification • Separator, escaping, quoting, and etc. – Lacks of support for nested data types 23
  24. 24. Raw text files Arbitrary line-based text files • Splitting files into lines using spark.read.text() – Keep your lines a reasonable size • Limited schema with only one field – value: (StringType) 24
  25. 25. Binary • New file data source in Spark 3.0 • Reads binary files and converts each file into a single record that contains the raw content and metadata of the file 25
  26. 26. Binary Schema • path (StringType) • modificationTime (TimestampType) • length (LongType) • content (BinaryType) 26
  27. 27. Binary To reads all JPG files recursively from the input directory and ignores partition discovery 27 spark.read.format("binaryFile") .option("pathGlobFilter", "*.jpg") .option("recursiveFileLookup", "true") .load("/path/to/dir")
  28. 28. Agenda • File formats • Data layout • File reader internals • File writer internals 28
  29. 29. Data layout • Partitioning • Bucketing 29
  30. 30. Partitioning • Coarse-grained data skipping • Available for both persisted tables and raw directories • Automatically discovers Hive style partitioned directories 30
  31. 31. Partitioning Filter predicates • Use simple filter predicates containing partition columns to leverage partition pruning • Avoids unnecessary partition discovery (esp. valuable for S3) 31 SELECT * FROM music WHERE year = 2019 and genre=‘folk’
  32. 32. Partitioning Check our blog post https://tinyurl.com/y2 eow2rx for more details 32
  33. 33. Partitioning SQL CREATE TABLE ratings USING PARQUET PARTITIONED BY (year, genre) AS SELECT artist, rating, year, genre FROM music 33 DataFrame API spark .table(“music”) .select(’artist, ’rating, ’year, ’genre) .write .format(“parquet”) .partitionBy(’year, ’genre) .saveAsTable(“ratings”)
  34. 34. Partitioning: new feature in Spark 3.0 34
  35. 35. Partitioning: new feature in Spark 3.0 35 More details in the talk Dynamic Partition Pruning in Apache Spark Bogdan Ghit & Juliusz Sompolski
  36. 36. Partitioning Avoid excessive partitions • Stress metastore for getting partition columns and values • Stress file system for listing files under partition directories – Esp. slow on S3 • Suggestions – Avoid using too many partition columns – Avoid using partition columns with too many distinct values • Try hashing the values • E.g., partition by first letter of first name rather than first name – Try Delta Lake 36
  37. 37. Partitioning Delta Lake’s transaction logs are analyzed in parallel by Spark • No need to stress Hive Metastore for partition values and file system for listing files • Able to handle billions of partitions and files 37
  38. 38. Data layout • Partitioning • Bucketing 38
  39. 39. Bucketing 39 File Scan Exchange Sort Sort merge join File Scan Exchange Sort Shuffle is slow.
  40. 40. Bucketing 40 File Scan Exchange Sort Sort merge join File Scan Exchange Sort Pre-shuffle + Pre-sort
  41. 41. Bucketing 41 File Scan Sort merge join File Scan Use bucketing on tables which are frequently JOINed with same keys
  42. 42. Bucketing SQL CREATE TABLE ratings USING PARQUET PARTITIONED BY (year, genre) CLUSTERED BY (rating) INTO 5 BUCKETS SORTED BY (rating) AS SELECT artist, rating, year, genre FROM music 42 DataFrame API spark .table(“music”) .select(’artist, ’rating, ’year, ’genre) .write .format(“parquet”) .partitionBy(’year, ’genre) .bucketBy(5, “rating”) .sortBy(“rating”) .saveAsTable(“ratings”)
  43. 43. Bucketing In combo with columnar formats: works best when the searched columns are sorted • Bucketing – Per-bucket sorting • Columnar formats – Efficient data skipping based on min/max statistics 43
  44. 44. Agenda • File formats • Data layout • File reader internals • File writer internals 44
  45. 45. Query example 45 spark.read.parquet("/data/events") .where("year = 2019") .where("city = 'Amsterdam'") .select("timestamp")
  46. 46. Goal • Understand data and skip unneeded data • Split file into partitions for parallel read 46
  47. 47. Data layout 47 Events year=2019 year=2018 year=2017 year=2016 parquet files row group 0 city timestamp OS browser other columns.. row group 1 row group N Footer . .
  48. 48. Partition pruning 48 Events year=2019 year=2018 year=2017 year=2016 parquet files row group 0 city timestamp OS browser other columns.. row group 1 . . row group N Footer spark .read.parquet("/data/events) .where("year = 2019") .where("city = 'Amsterdam'") .select("timestamp") .collect() . .
  49. 49. Row group skipping 49 Events year=2019 year=2018 year=2017 year=2016 parquet files row group 0 city timestamp OS browser other columns.. row group 1 . . row group N Footer spark .read.parquet("/data/events) .where("year = 2019") .where("city = 'Amsterdam'") .select("timestamp") .collect() . .
  50. 50. Column pruning 50 Events year=2019 year=2018 year=2017 year=2016 parquet files row group 0 city timestamp OS browser other columns.. row group 1 . . row group N Footer spark .read.parquet("/data/events) .where("year = 2019") .where("city = 'Amsterdam'") .select("timestamp") .collect() . .
  51. 51. Goal • Understand data and skip unneeded data • Split file into partitions for parallel read 51
  52. 52. Partitions of same size 52 File 0 File 1 Partition 0 Partition 1 Partition 2 File 2HDFS Spark
  53. 53. Driver: plan input partitions 53 Spark Driver Partition 0 Partition 1 Partition 2 1. Split into File partitions
  54. 54. Driver: plan input partitions 54 Spark Driver Executor 0 Executor 1 Executor 2 2. Launch read tasks Partition 0 Partition 1 Partition 2 1. Split into File Partitions partitions
  55. 55. Executor: Read distributedly 55 Spark Driver Executor 0 Executor 1 Executor 2 3. Actual Reading Partition 0 Partition 1 Partition 2 1. Split into File Partitions partitions 2. Launch read tasks
  56. 56. Agenda • File formats • Data layout • File reader internals • File writer internals 56
  57. 57. Query example 57 val trainingData = spark.read.parquet("/data/events") .where("year = 2019") .where("city = 'Amsterdam'") .select("timestamp") trainingData.write.parquet("/data/results")
  58. 58. Goal • Parallel • Transactional 58
  59. 59. 59 Executor 0 Executor 1 Executor 2 1. Write task Spark Driver
  60. 60. 60 part-00000Executor 0 Executor 1 Executor 2 2. Write to files Spark Driver Each task writes to different temporary paths part-00001 part-00002 1. Write task
  61. 61. Everything should be temporary 61 results _temporary
  62. 62. Files should be isolated between jobs 62 results _temporary job id job id
  63. 63. Task output is also temporary results _temporary job id _temporary
  64. 64. Files should be isolated between tasks 64 results _temporary job id _temporary task attempt id parquet files task attempt id parquet files task attempt id parquet files
  65. 65. 65 part-00000Executor 0 Executor 1 Executor 2 2. Write to files Spark Driver Each task writes to different temporary paths part-00001 part-00002 1. Write task Commit task 3. commit task
  66. 66. File layout 66 results _temporary job id task attempt id parquet files task id parquet files task id parquet files _temporary In progress Committed
  67. 67. 67 part-00000Executor 0 Executor 1 Executor 2 2. Write to files Spark Driver Each task writes to different temporary paths part-00001 part-00002 1. Write task If task aborts.. 3. abort task
  68. 68. File layout 68 results _temporary job id task attempt id parquet files task id parquet files task id parquet files _temporary On task abort, delete the task output path
  69. 69. 69 part-00000Executor 0 Executor 1 Executor 2 2. Write to files Spark Driver Each task writes to different temporary paths part-00001 part-00002 1. Write task If task aborts.. 3. abort task 4. Relaunch task
  70. 70. 70 part-00000Executor 0 Executor 1 Executor 2 2. Write to files Spark Driver Each task writes to different temporary paths part-00001 part-00002 1. Write task 5. commit task 4. Relaunch task Distributed and Transactional Write
  71. 71. 71 part-00000Executor 0 Executor 1 Executor 2 2. Write to files Spark Driver Each task writes to different temporary paths part-00001 part-00002 1. Write task 5. commit task 4. Relaunch task Distributed and Transactional Write 6. Commit Job
  72. 72. File layout 72 results parquet files parquet files parquet files
  73. 73. Almost transactional 73 The window of failure is small Spark stages output files to a temporary location Commit? Move to final locations Abort; Delete staged files The window of failure is small See more for concurrent reads and writes in cloud storage Transactional writes to cloud storage – Eric Liang Diving Into Delta Lake: Unpacking The Transaction Log – Databricks blog
  74. 74. Recap File formats • Column-oriented – Parquet – ORC • Row-oriented – Avro – JSON – CSV – Text – Binary 74 Data layout • Partitioning • Bucketing File reader • Understand data and skip unneeded data • Split file into partitions for parallel read File writer internals • Parallel • transactional
  75. 75. Thank you! Q & A
  76. 76. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×