Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Migrating ETL
workflow to Spark at
scale in Pinterest
Daniel Dai, Zirui Li
Pinterest Inc
About Us
• Daniel Dai
• Tech Lead at Pinterest
• PMC member for Apache Hive and Pig
• Zirui Li
• Software Engineer at Pint...
Agenda
▪ Spark @ Pinterest
▪ Cascading/Scalding to
Spark Conversion
▪ Technical Challenges
▪ Migration Process
▪ Result an...
Agenda
• Spark @ Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result an...
We Are on Cloud
• We use AWS
• However, we build our
own clusters
• Avoid vendor lockdown
• Timely support by our own team...
Spark Clusters
• We have a couple of Spark clusters
• From several hundred nodes to 1000+ nodes
• Spark only cluster and m...
Spark Versions and Use Cases
• We are running Spark 2.4
• With quite a few internal fixes
• Will migrate to 3.1 this year
...
Migration Plan
• 40% workloads are already
Spark
• The number is 12% one year ago
• Migration in progress
• Hive to SparkS...
Migration Plan
• Still half workloads are on
Cascading/Scalding
• ETL use cases
• Spark Future
• Query engine: Presto/Spar...
Agenda
• Spark in Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result a...
Cascading
• Simple DAG
• Only 6 different pipes
• Most logic in UDF
• Each – UDF in map
• Every – UDF in reduce
• Java API...
Scalding
• Rich set of operators on top of Cascading
• Operators are very similar to Spark RDD
• Scala API
Migration Path
+
▪ UDF
interface is
private
▪ SQL easy to
migrate to
any engine
Recommend if there’s not
many UDFs
SparkSQ...
Spark API
▪ Newer & Recommended API
RDD
Spark Dataframe/Dataset
▪ Most inputs are thrift sequence files
▪ Encode/Decode th...
Rewrite the
application manually
Reuse most of
Cascading/Scalding
library code
▪ However, avoid
Cascading
specific structu...
Translate Cascading
• DAG is usually simple
• Most Cascading pipe has one-to-one mapping to Spark transformation
// val pr...
UDF Translation
Semantic Difference
Multi-threading
UDF initialization
and cleanup
▪ Do both filtering &
transformation
▪ ...
Translate Scalding
• Most operator has 1 to 1 mapping to RDD operator
• UDF can be used in Spark without change
Scalding O...
Agenda
• Spark in Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result a...
Secondary Sort
• Use “repartitionAndSortWithinPartitions” in Spark
• There’s gap in semantics: Use GroupSortedIterator to ...
Accumulators
• Spark accumulator is not
accurate
• Stage retry
• Same code run multiple times in different
stage
• Solutio...
Accumulator Continue
• Retrieve the Accumulator of
the Earliest Stage
• Exception: user intentionally
use the same accumul...
Accumulator Tab in Spark UI
• SPARK-35197
Profiling
• Visualize frame graph using Nebula
• Realtime
• Ability to segment into stage/task
• Focus on only useful thre...
OutputCommitter
• Issue with OutputCommitter
• slow metadata operation
• 503 errors
• Netflix s3committer
• Wrapper for Sp...
Agenda
• Spark @ Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result an...
Automatic Migration Service (AMS)
• A tool to automate majority of migration process
Data Validation
Row counts
Checksum
Comparison
Create a table around
output
SparkSQL UDF
CountAndChecksumUdaf
Doesn’t work...
Input depends
on current
timestamp
There's
random
number
generator in
the code
Rounding
differences
which result
differenc...
Performance Tuning
Collect runtime
memory/vcore usage
Tuning passed if
criterias meet:
▪ Runtime reduced
▪ Vcore*sec
reduc...
Balancing Performance
• Trade-offs
• More executors
• Better performance, but cost more
• Use more cores per executor
• Sa...
▪ Automatically pick Spark over
Cascading/Scalding during runtime if
condition meets
▪ Data Validation Pass
▪ Performance ...
Agenda
• Spark @ Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result an...
Result
• 40% performance improvement
• 47% cost saving on cpu
• Use 33% more memory
Future Plan
• Manual conversion for application still evolving
• Spark backend for legacy application
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
Upcoming SlideShare
Loading in …5
×

of

Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 1 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 2 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 3 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 4 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 5 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 6 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 7 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 8 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 9 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 10 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 11 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 12 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 13 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 14 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 15 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 16 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 17 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 18 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 19 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 20 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 21 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 22 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 23 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 24 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 25 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 26 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 27 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 28 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 29 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 30 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 31 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 32 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 33 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 34 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 35 Migrating ETL Workflow to Apache Spark at Scale in Pinterest Slide 36
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Migrating ETL Workflow to Apache Spark at Scale in Pinterest

Download to read offline

Pinterest is moving all batch processing to Apache Spark, which includes a large amount of legacy ETL workflows written in Cascading/Scalding. In this talk, we will share the challenges and solutions we experienced during this migration, which includes the motivation of the migration, how to fill the semantic gap between different engines, the difficulty dealing with thrift objects widely used in Pinterest, how we improve Spark accumulators, how to tune the Spark performance after migration using our innovative Spark profiler, and also the performance improvements and cost saving we have achieved after the migration.

  • Be the first to like this

Migrating ETL Workflow to Apache Spark at Scale in Pinterest

  1. 1. Migrating ETL workflow to Spark at scale in Pinterest Daniel Dai, Zirui Li Pinterest Inc
  2. 2. About Us • Daniel Dai • Tech Lead at Pinterest • PMC member for Apache Hive and Pig • Zirui Li • Software Engineer at Pinterest Spark Platform Team • Focus on building Pinterest in-house Spark platform & functionalities
  3. 3. Agenda ▪ Spark @ Pinterest ▪ Cascading/Scalding to Spark Conversion ▪ Technical Challenges ▪ Migration Process ▪ Result and Future Plan
  4. 4. Agenda • Spark @ Pinterest • Cascading/Scalding to Spark Conversion • Technical Challenges • Migration Process • Result and Future Plan
  5. 5. We Are on Cloud • We use AWS • However, we build our own clusters • Avoid vendor lockdown • Timely support by our own team • We store everything on S3 • Cost less than HDFS • HDFS is for temporary storage S3 EC2 HDFS Yarn EC2 HDFS Yarn EC2 HDFS Yarn
  6. 6. Spark Clusters • We have a couple of Spark clusters • From several hundred nodes to 1000+ nodes • Spark only cluster and mixed use cluster • Cross cluster routing • R5D instance type for Spark only cluster • Faster local disk • High memory to cpu ratio
  7. 7. Spark Versions and Use Cases • We are running Spark 2.4 • With quite a few internal fixes • Will migrate to 3.1 this year • Use cases • Production use cases • SparkSQL, PySpark, Spark Native via airflow • Adhoc use case • SparkSQL via Querybook, PySpark via Jupyter
  8. 8. Migration Plan • 40% workloads are already Spark • The number is 12% one year ago • Migration in progress • Hive to SparkSQL • Cacading/Scalding to Spark • Hadoop streaming to Spark pipe Hive Cascading/Scalding Hadoop Streaming Where are we?
  9. 9. Migration Plan • Still half workloads are on Cascading/Scalding • ETL use cases • Spark Future • Query engine: Presto/SparkSQL • ETL: Spark native • Machine learning: PySpark
  10. 10. Agenda • Spark in Pinterest • Cascading/Scalding to Spark Conversion • Technical Challenges • Migration Process • Result and Future Plan
  11. 11. Cascading • Simple DAG • Only 6 different pipes • Most logic in UDF • Each – UDF in map • Every – UDF in reduce • Java API Source Each GroupBy Every Sink Pattern 1 Source Each CoGroup Every Sink Pattern 2 Source Each
  12. 12. Scalding • Rich set of operators on top of Cascading • Operators are very similar to Spark RDD • Scala API
  13. 13. Migration Path + ▪ UDF interface is private ▪ SQL easy to migrate to any engine Recommend if there’s not many UDFs SparkSQL − PySpark ▪ Suboptimal performanc e, especially for Python UDF ▪ Rich Python libraries available to use + − Recommended for Machine Learning only + Native Spark ▪ most structured path to enjoin rich spark syntax ▪ Work for almost all Cascading/Scalding applications Default & Recommended for general cases
  14. 14. Spark API ▪ Newer & Recommended API RDD Spark Dataframe/Dataset ▪ Most inputs are thrift sequence files ▪ Encode/Decode thrift object to/from dataframe is slow Recommended only for non-thrift- sequence file ▪ More Flexible on handling thrift object serialization / deserialization ▪ Semantically close to Scalding ▪ Older API ▪ Less performant than Dataframe Default choice for the conversion + − + −
  15. 15. Rewrite the application manually Reuse most of Cascading/Scalding library code ▪ However, avoid Cascading specific structure Automatic tool to help result validation & performance tuning Approach
  16. 16. Translate Cascading • DAG is usually simple • Most Cascading pipe has one-to-one mapping to Spark transformation // val processedInput: RDD[(String, Token)] // val tokenFreq: RDD[(String, Double)] val tokenFreqVar = spark.sparkContext.broadcast(tokenFreq.collectAsMap()) val joined = processedInput.map { t => (t._1, (t._2, tokenFreqVar.value.get(t._1))) } Cascading Pipe Spark RDD Operator Note Each Map side UDF Every Reduce side UDF Merge union CoGroup join/leftOuterJoin/right OuterJoin/fullOuterJoin GroupBy GroupBy/GroupByKey secondary sort might be needed HashJoin Broadcast join no native support in RDD, simulate via broadcast variable • Complexity is in UDF
  17. 17. UDF Translation Semantic Difference Multi-threading UDF initialization and cleanup ▪ Do both filtering & transformation ▪ Java ▪ map + filter ▪ Scala ▪ Multi-thread model ▪ Worst case set executor-cores=1 ▪ Single-thread model ▪ Class with initialization & cleanup ▪ No init / cleanup hook ▪ mapPartitions to simulate Cascading UDF Spark VS .mapPartitions{iter => // Expensive initialization block // init block while (iter.hasNext()) { val event = iter.next process(event) } // cleanup block }
  18. 18. Translate Scalding • Most operator has 1 to 1 mapping to RDD operator • UDF can be used in Spark without change Scalding Operator Spark RDD Operator Note map map flatMap flatMap filter filter filterNot filter Spark does not have filterNot, use filter with negative condition groupBy groupBy group groupByKey groupAll groupBy(t=>1) ...
  19. 19. Agenda • Spark in Pinterest • Cascading/Scalding to Spark Conversion • Technical Challenges • Migration Process • Result and Future Plan
  20. 20. Secondary Sort • Use “repartitionAndSortWithinPartitions” in Spark • There’s gap in semantics: Use GroupSortedIterator to fill the gap output = new GroupBy(output, new Fields("user_id"), new Fields("sec_key")); group key sort key (2, 2), "apple" (1, 3), "facebook" (1, 1), "pinterest" (1, 2), "twitter" (3, 2), "google" input iterator for key 1: (1, 1), "pinterest" (1, 2), "twitter" (1, 3), "facebook" iterator for key 2: (2, 2), "apple" iterator for key 3: (3, 2), "google" Cascading (1, 1), "pinterest" (1, 2), "twitter" (1, 3), "facebook" (2, 2), "apple" (3, 2), "google" Spark
  21. 21. Accumulators • Spark accumulator is not accurate • Stage retry • Same code run multiple times in different stage • Solution • Deduplicate with stage+partition • persist val sc = new SparkContext(conf); val inputRecords = sc.longAccumulator("Input") val a = sc.textFile("studenttab10k"); val b = a.map(line => line.split("t")); val c = b.map { t => inputRecords.add(1L) (t(0), t(1).toInt, t(2).toDouble) }; val sumScore = c.map(t => t._3).sum() // c.persist() c.map { t => (t._1, t._3/sumScore) }.saveAsTextFile("output")
  22. 22. Accumulator Continue • Retrieve the Accumulator of the Earliest Stage • Exception: user intentionally use the same accumulator in different stages NUM_OUTPUT_TOKENS Stage 14: 168006868318 Stage 21: 336013736636 val sc = new SparkContext(conf); val inputRecords = sc.longAccumulator("Input") val input1 = sc.textFile("input1"); val input1_processed = input1.map { t => inputRecords.add(1L) (t(0), (t(1).toInt, t(2).toDouble)) }; val input2 = sc.textFile("input2"); val input2_processed = input2.map { t => inputRecords.add(1L) (t(0), (t(1).toInt, t(2).toDouble)) }; input1_processed.join(input2_processed) .saveAsTextFile("output")
  23. 23. Accumulator Tab in Spark UI • SPARK-35197
  24. 24. Profiling • Visualize frame graph using Nebula • Realtime • Ability to segment into stage/task • Focus on only useful threads
  25. 25. OutputCommitter • Issue with OutputCommitter • slow metadata operation • 503 errors • Netflix s3committer • Wrapper for Spark RDD • s3committer only support old API
  26. 26. Agenda • Spark @ Pinterest • Cascading/Scalding to Spark Conversion • Technical Challenges • Migration Process • Result and Future Plan
  27. 27. Automatic Migration Service (AMS) • A tool to automate majority of migration process
  28. 28. Data Validation Row counts Checksum Comparison Create a table around output SparkSQL UDF CountAndChecksumUdaf Doesn’t work for double/float Doesn’t work for array if order is different −
  29. 29. Input depends on current timestamp There's random number generator in the code Rounding differences which result differences in filter condition test Unstable top result if there's a tie Source of Uncertainty
  30. 30. Performance Tuning Collect runtime memory/vcore usage Tuning passed if criterias meet: ▪ Runtime reduced ▪ Vcore*sec reduced 20%+ ▪ Memory increase less than 100% Retry with tuned memory / vcore if necessary
  31. 31. Balancing Performance • Trade-offs • More executors • Better performance, but cost more • Use more cores per executor • Save on memory, but cost more on cpu • Use dynamic allocation usually save cost • Skew won’t cost more with dynamic allocation • Control parallelism • spark.default.parallelism for RDD • spark.sql.shuffle.partitions for dataframe/dataset/SparkSQL
  32. 32. ▪ Automatically pick Spark over Cascading/Scalding during runtime if condition meets ▪ Data Validation Pass ▪ Performance Optimization Pass ▪ Automatically handle failure with handlers if applicable ▪ Configuration incorrectness ▪ OutOfMemory ▪ ... ▪ Manual troubleshooting is needed for other uncaught failures Failure handling Automatic Migration Automatic Migration & Failure Handling
  33. 33. Agenda • Spark @ Pinterest • Cascading/Scalding to Spark Conversion • Technical Challenges • Migration Process • Result and Future Plan
  34. 34. Result • 40% performance improvement • 47% cost saving on cpu • Use 33% more memory
  35. 35. Future Plan • Manual conversion for application still evolving • Spark backend for legacy application
  36. 36. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Pinterest is moving all batch processing to Apache Spark, which includes a large amount of legacy ETL workflows written in Cascading/Scalding. In this talk, we will share the challenges and solutions we experienced during this migration, which includes the motivation of the migration, how to fill the semantic gap between different engines, the difficulty dealing with thrift objects widely used in Pinterest, how we improve Spark accumulators, how to tune the Spark performance after migration using our innovative Spark profiler, and also the performance improvements and cost saving we have achieved after the migration.

Views

Total views

95

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

6

Shares

0

Comments

0

Likes

0

×