Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Producing Spark on YARN for ETL

1,477 views

Published on

Producing Spark on YARN for ETL

Published in: Technology
  • Be the first to comment

Producing Spark on YARN for ETL

  1. 1. Ashwin Shankar Nezih Yigitbasi Productionizing Spark on Yarn for ETL
  2. 2. Scale
  3. 3. 81+ million members Global 1000+ devices supported 125 million hours / day Netflix Key Business Metrics
  4. 4. 40 PB DW Read 3PB Write 300TB700B Events Netflix Key Platform Metrics
  5. 5. Outline ● Big Data Platform Architecture ● Technical Challenges ● ETL
  6. 6. Big Data Platform Architecture
  7. 7. Cloud apps Kafka Ursula Cassandra Aegisthus Dimension Data Event Data ~ 1 min Daily S3 SS Tables Data Pipeline
  8. 8. Storage Compute Service Tools Big Data APIBig Data Portal S3 Parquet Transport VisualizationQuality PigWorkflowVis Job/ClusterVis Interface Execution Metadata Notebooks
  9. 9. • 3000 EC2 nodes on two clusters (d2.4xlarge) • Multiple Spark versions • Share the same infrastructure with MapReduce jobs S M S M S M M … 16 vcores 120 GB M S MapReduceS S M S M S M MS Spark S M S M S M MS S M S M S M MS Spark on YARN at Netflix
  10. 10. $ spark-shell --ver 1.6 … s3://…/spark-1.6.tar.gz s3://…/spark-2.0.tar.gz s3://…/spark-2.0-unstable.tar.gz s3://…/1.6/spark-defaults.conf … s3://…/prod/yarn-site.xml s3://../prod/core-site.xml ConfigurationApplication Multi-version Support
  11. 11. Technical Challenges
  12. 12. YARN Resource Manager Node Manager Spark AM RDD
  13. 13. Custom Coalescer Support [SPARK-14042] • coalesce() can only “merge” using the given number of partitions – how to merge by size? • CombineFileInputFormat with Hive • Support custom partition coalescing strategies
  14. 14. • Parent RDD partitions are listed sequentially • Slow for tables with lots of partitions • Parallelize listing of parent RDD partitions UnionRDD Parallel Listing [SPARK-9926]
  15. 15. YARN Resource Manager S3Filesystem RDD Node Manager Spark AM
  16. 16. • Unnecessary getFileStatus() call • SPARK-9926 and HADOOP-12810 yield faster startup • ~20x speedup in input split calculation Optimize S3 Listing Performance [HADOOP-12810]
  17. 17. • Each task writes output to a temp directory • Rename first successful task’s temp directory to final destination • Problems with S3 • S3 rename => copy + delete • S3 is eventually consistent Hadoop Output Committer
  18. 18. • Each task writes output to local disk • Copy first successful task’s output to S3 • Advantages • avoid redundant S3 copy • avoid eventual consistency S3 Output Committer
  19. 19. YARN Resource Manager Dynamic Allocation S3Filesystem RDD Node Manager Spark AM
  20. 20. • Broadcast joins/variables • Replicas can be removed with dynamic allocation Poor Broadcast Read Performance [SPARK-13328] ... 16/02/13 01:02:27WARN BlockManager: Failed to fetch remote block broadcast_18_piece0 (failed attempt 70) ... 16/02/13 01:02:27 INFOTorrentBroadcast: Reading broadcast variable 18 took 1051049 ms • Refresh replica locations from the driver on multiple failures
  21. 21. • Cancel & resend pending container requests • if the locality preference is no longer needed • if no locality preference is set • No locality information with S3 • Do not cancel requests without locality preference Incorrect Locality Optimization [SPARK-13779]
  22. 22. YARN Resource Manager Parquet R/W Dynamic Allocation S3Filesystem RDD Node Manager Spark AM
  23. 23. A B C D a1 b1 c1 d1 … … … … aN bN cN dN A B C D dictionary from “Analytic Data Storage in Hadoop”, Ryan Blue Parquet Dictionary Filtering [PARQUET-384*]
  24. 24. 0 10 20 30 40 50 60 70 80 DF disabled DF enabled 64MB split DF enabled 1G split DF disabled DF enabled 64MB split DF enabled 1G split ~8x ~18x Parquet Dictionary Filtering [PARQUET-384*] Avg.CompletionTime[m]
  25. 25. Property Value Description spark.sql.hive.convertMetastoreParquet true enable native Parquet read path parquet.filter.statistics.enabled true enable stats filtering parquet.filter.dictionary.enabled true enable dictionary filtering spark.sql.parquet.filterPushdown true enable Parquet filter pushdown optimization spark.sql.parquet.mergeSchema false disable schema merging spark.sql.hive.convertMetastoreParquet.mergeSchema false use Hive SerDe instead of built-in Parquet support How to Enable Dictionary Filtering?
  26. 26. Efficient Dynamic Partition Inserts [SPARK-15420*] • Parquet buffers row group data for each file during writes • Spark already sorts before writes, but has some limitations • Detect if the data is already sorted • Expose the ability to repartition data before write
  27. 27. YARN Resource Manager Parquet R/W Dynamic Allocation Spark HistoryServer S3Filesystem RDD Node Manager Spark AM
  28. 28. Spark History Server – Where is My Job?
  29. 29. • A large application can prevent new applications from showing up • not uncommon to see event logs of GBs • SPARK-13988 makes the processing multi-threaded • GC tuning helps further • move from CMS to G1 GC • allocate more space to young generation Spark History Server – Where is My Job?
  30. 30. Extract Transform Load
  31. 31. group foreach join foreach + filter + store join foreach foreach join join join load + filter load + filter load + filter load + filter load + filter load + filter Pig vs. Spark: Job #1
  32. 32. 0 100 200 300 400 Pig Spark PySpark ~2.4x ~2x Avg.CompletionTime[s] Pig vs. Spark (Scala) vs. PySpark
  33. 33. load + filter cogroup join order by + store foreach load + filter load + filter join foreach foreach foreach foreach foreach Pig vs. Spark: Job #2
  34. 34. 0 100 200 300 400 500 Pig Spark PySpark ~3.2x ~1.6x Avg.CompletionTime[s] Pig vs. Spark (Scala) vs. PySpark
  35. 35. Prototype DeployBuild Run S3 Production Workflow
  36. 36. • A rapid innovation platform for targeting algorithms • 5 hours (vs. 10s of hours) to compute similarity for all Netflix profiles for 30-day window of new arrival titles • 10 minutes to score 4M profiles for 14-day window of new arrival titles Production Spark Application #1: Yogen
  37. 37. • Personalized ordering of rows of titles • Enrich page/row/title features with play history • 14 stages, ~10Ks of tasks, severalTBs Production Spark Application #2: ARO
  38. 38. What’s Next? • Improved Parquet support • Better visibility • Explore new use cases
  39. 39. Questions?

×