Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Ashwin Shankar
Nezih Yigitbasi
Productionizing
Spark onYarn
for ETL
Scale
81+ million
members
Global 1000+ devices
supported
125 million
hours / day
Netflix Key Business Metrics
40 PB DW Read 3PB Write 300TB700B Events
Netflix Key Platform Metrics
Outline
● Big Data Platform Architecture
● Technical Challenges
● ETL
Big Data Platform
Architecture
Cloud
apps
Kafka Ursula
Cassandra
Aegisthus
Dimension Data
Event Data
~ 1 min
Daily
S3
SS
Tables
Data Pipeline
Storage
Compute
Service
Tools
Big Data APIBig Data Portal
S3 Parquet
Transport VisualizationQuality Pig Workflow Vis Job/C...
• 3000 EC2 nodes on two clusters (d2.4xlarge)
• Multiple Spark versions
• Share the same infrastructure with MapReduce job...
Technical Challenges
YARN
Resource
Manager
Node
Manager
Spark
AM
RDD
Custom Coalescer Support [SPARK-14042]
• coalesce() can only“merge” using the given number of partitions
– how to merge by...
• Parent RDD partitions are listed sequentially
• Slow for tables with lots of partitions
• Parallelizelistingof parent RD...
YARN
Resource
Manager
S3Filesystem
RDD
Node
Manager
Spark
AM
• UnnecessarygetFileStatus() call
• SPARK-9926 and HADOOP-12810 yield faster startup
• ~20x speedup in input split calcula...
Output Committers
Hadoop Output Committer
• Write to a temp directory and rename to destination on success
• S3 rename => ...
YARN
Resource
Manager
Dynamic
Allocation
S3Filesystem
RDD
Node
Manager
Spark
AM
• Broadcast joins/variables
• Replicascan be removed with dynamicallocation
Poor Broadcast Read Performance [SPARK-13328]
...
• Cancel & resend pending container requests
• if the locality preference is no longer needed
• if no locality preference ...
YARN
Resource
Manager
Parquet R/W
Dynamic
Allocation
S3Filesystem
RDD
Node
Manager
Spark
AM
A B C D
a1 b1 c1 d1
… … … …
aN bN cN dN
A B C D
dictionary
from “Analytic Data Storage in Hadoop”, Ryan Blue
Parquet Dicti...
0
10
20
30
40
50
60
70
80
DF	disabled DF	enabled	
64MB	split
DF	enabled	
1G	split
DF	disabled
DF	enabled	
64MB	split
DF	en...
Property Value Description
spark.sql.hive.convertMetastoreParquet true enable	native	Parquet read	path
parquet.filter.stat...
Efficient Dynamic Partition Inserts [SPARK-15420*]
• Parquet buffers row group data for each file during writes
• Spark al...
YARN
Resource
Manager
Parquet R/W
Dynamic
Allocation
Spark
HistoryServer
S3Filesystem
RDD
Node
Manager
Spark
AM
Spark History Server – Where is My Job?
• A large applicationcan prevent new applicationsfrom showing up
• not uncommon to see event logs of GBs
• SPARK-13988 mak...
Extract
Transform
Load
group
foreach
join
foreach + filter + store
join
foreach foreach
join
join
join
load + filter load + filter load + filter ...
0
100
200
300
400
Pig Spark PySpark
~2.4x ~2x
Avg.	Completion	Time	[s]
Pig vs. Spark (Scala) vs. PySpark
Prototype DeployBuild Run
S3
Production Workflow
• A rapid innovation platform for targeting algorithms
• 5 hours (vs. 10s of hours) to compute similarityfor all Netflix
p...
• Personalizedordering of rows of titles
• Enrich page/row/title features with playhistory
• 14 stages, ~10Ks of tasks, se...
What’s Next?
• Improved Parquet support
• Better visibility
• Explore new use cases
Questions?
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Upcoming SlideShare
Loading in …5
×

Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale

4,003 views

Published on

Spark Summit 2016 talk by Ashwin Shankar (Netflix) and
Nezih Yigitbasi (Netflix)

Published in: Data & Analytics

Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale

  1. 1. Ashwin Shankar Nezih Yigitbasi Productionizing Spark onYarn for ETL
  2. 2. Scale
  3. 3. 81+ million members Global 1000+ devices supported 125 million hours / day Netflix Key Business Metrics
  4. 4. 40 PB DW Read 3PB Write 300TB700B Events Netflix Key Platform Metrics
  5. 5. Outline ● Big Data Platform Architecture ● Technical Challenges ● ETL
  6. 6. Big Data Platform Architecture
  7. 7. Cloud apps Kafka Ursula Cassandra Aegisthus Dimension Data Event Data ~ 1 min Daily S3 SS Tables Data Pipeline
  8. 8. Storage Compute Service Tools Big Data APIBig Data Portal S3 Parquet Transport VisualizationQuality Pig Workflow Vis Job/Cluster Vis Interface Execution Metadata Notebooks
  9. 9. • 3000 EC2 nodes on two clusters (d2.4xlarge) • Multiple Spark versions • Share the same infrastructure with MapReduce jobs S M S M S M M … 16 vcores 120 GB M S MapReduceS S M S M S M MS Spark S M S M S M MS S M S M S M MS Spark onYARN at Netflix
  10. 10. Technical Challenges
  11. 11. YARN Resource Manager Node Manager Spark AM RDD
  12. 12. Custom Coalescer Support [SPARK-14042] • coalesce() can only“merge” using the given number of partitions – how to merge by size? • CombineFileInputFormat with Hive • Support custom partition coalescingstrategies
  13. 13. • Parent RDD partitions are listed sequentially • Slow for tables with lots of partitions • Parallelizelistingof parent RDD partitions UnionRDD Parallel Listing [SPARK-9926]
  14. 14. YARN Resource Manager S3Filesystem RDD Node Manager Spark AM
  15. 15. • UnnecessarygetFileStatus() call • SPARK-9926 and HADOOP-12810 yield faster startup • ~20x speedup in input split calculation Optimize S3 Listing Performance [HADOOP-12810]
  16. 16. Output Committers Hadoop Output Committer • Write to a temp directory and rename to destination on success • S3 rename => copy + delete • S3 is eventually consistent S3 Output Committer • Write to localdisk and upload to S3 on success • avoid redundant S3 copy • avoid eventual consistency
  17. 17. YARN Resource Manager Dynamic Allocation S3Filesystem RDD Node Manager Spark AM
  18. 18. • Broadcast joins/variables • Replicascan be removed with dynamicallocation Poor Broadcast Read Performance [SPARK-13328] ... 16/02/13 01:02:27WARN BlockManager: Failed to fetch remote block broadcast_18_piece0 (failed attempt70) ... 16/02/13 01:02:27INFOTorrentBroadcast: Reading broadcast variable 18 took 1051049 ms • Refresh replicalocations from the driver on multiple failures
  19. 19. • Cancel & resend pending container requests • if the locality preference is no longer needed • if no locality preference is set • No localityinformationwith S3 • Do not cancelrequests without localitypreference Incorrect Locality Optimization [SPARK-13779]
  20. 20. YARN Resource Manager Parquet R/W Dynamic Allocation S3Filesystem RDD Node Manager Spark AM
  21. 21. A B C D a1 b1 c1 d1 … … … … aN bN cN dN A B C D dictionary from “Analytic Data Storage in Hadoop”, Ryan Blue Parquet Dictionary Filtering [PARQUET-384*]
  22. 22. 0 10 20 30 40 50 60 70 80 DF disabled DF enabled 64MB split DF enabled 1G split DF disabled DF enabled 64MB split DF enabled 1G split ~8x ~18x Parquet Dictionary Filtering [PARQUET-384*] Avg. Completion Time [m]
  23. 23. Property Value Description spark.sql.hive.convertMetastoreParquet true enable native Parquet read path parquet.filter.statistics.enabled true enable stats filtering parquet.filter.dictionary.enabled true enable dictionary filtering spark.sql.parquet.filterPushdown true enable Parquet filter pushdown optimization spark.sql.parquet.mergeSchema false disable schema merging spark.sql.hive.convertMetastoreParquet.mergeSchema false use Hive SerDe instead of built-in Parquet support How to Enable Dictionary Filtering?
  24. 24. Efficient Dynamic Partition Inserts [SPARK-15420*] • Parquet buffers row group data for each file during writes • Spark alreadysorts before writes, but has some limitations • Detect if the data is already sorted • Expose the abilityto repartition data before write
  25. 25. YARN Resource Manager Parquet R/W Dynamic Allocation Spark HistoryServer S3Filesystem RDD Node Manager Spark AM
  26. 26. Spark History Server – Where is My Job?
  27. 27. • A large applicationcan prevent new applicationsfrom showing up • not uncommon to see event logs of GBs • SPARK-13988 makes the processingmulti-threaded • GC tuning helps further • move fromCMS to G1GC • allocate more space to young generation Spark History Server – Where is My Job?
  28. 28. Extract Transform Load
  29. 29. group foreach join foreach + filter + store join foreach foreach join join join load + filter load + filter load + filter load + filter load + filter load + filter Pig vs. Spark
  30. 30. 0 100 200 300 400 Pig Spark PySpark ~2.4x ~2x Avg. Completion Time [s] Pig vs. Spark (Scala) vs. PySpark
  31. 31. Prototype DeployBuild Run S3 Production Workflow
  32. 32. • A rapid innovation platform for targeting algorithms • 5 hours (vs. 10s of hours) to compute similarityfor all Netflix profilesfor 30-day window of new arrival titles • 10 minutes to score 4M profilesfor 14-day window of new arrival titles Production Spark Application #1: Yogen
  33. 33. • Personalizedordering of rows of titles • Enrich page/row/title features with playhistory • 14 stages, ~10Ks of tasks, severalTBs Production Spark Application #2: ARO
  34. 34. What’s Next? • Improved Parquet support • Better visibility • Explore new use cases
  35. 35. Questions?

×