Successfully reported this slideshow.
Your SlideShare is downloading. ×

Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 39 Ad

Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

Download to read offline

In this talk, we will discuss several advantages of the Spark RDD API for developing custom applications when compared to pure SQL-like interfaces such as Hive. In particular, we will describe how to control data distribution, avoid data skew, and implement application specific optimizations in order to build performant and reliable data pipelines. In order to illustrate these ideas, we will share our experiences redesigning a large-scale, complex (100+ stage) language model training pipeline for Spark that was originally built in Hive. The final Spark based pipeline is modular, readable, and more maintainable when compared to previous set of HQL queries. In addition to the qualitative improvements, we also observed a significant reduction in both resource usage and data landing time. Finally, we will also describe Spark optimizations that we implemented for this workload that can be applied toward batch workloads in general.

In this talk, we will discuss several advantages of the Spark RDD API for developing custom applications when compared to pure SQL-like interfaces such as Hive. In particular, we will describe how to control data distribution, avoid data skew, and implement application specific optimizations in order to build performant and reliable data pipelines. In order to illustrate these ideas, we will share our experiences redesigning a large-scale, complex (100+ stage) language model training pipeline for Spark that was originally built in Hive. The final Spark based pipeline is modular, readable, and more maintainable when compared to previous set of HQL queries. In addition to the qualitative improvements, we also observed a significant reduction in both resource usage and data landing time. Finally, we will also describe Spark optimizations that we implemented for this workload that can be applied toward batch workloads in general.

Advertisement
Advertisement

More Related Content

Viewers also liked (20)

Advertisement

More from Spark Summit (20)

Recently uploaded (20)

Advertisement

Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil

  1. 1. Custom applications with Spark’s RDD Tejas Patil Facebook
  2. 2. Agenda • Use case • Real world applications • Previous solution • Spark version • Data skew • Performance evaluation
  3. 3. N-gram language model training
  4. 4. Can you please come here ? History 5-gram Word being predicted
  5. 5. Real world applications
  6. 6. Auto-subtitling for Page videos
  7. 7. Detecting low quality places • Non-public places • My home • Home sweet home • Non-real places • Apt #00, Fake lane, Foo City, CA • Mordor, Westeros !! • Non-suitable for watch • Anything containing nudity, intense sexuality, profanity or disturbing content
  8. 8. Previous solution
  9. 9. Sub-model 1 training job Sub-model 2 training job Sub-model `n` training job Interpolation algorithm Language model LM 1 LM 2 LM `n` ….................... Intermediate sub models Hive query Hive table
  10. 10. Sub-model 2 training job Sub-model `n` training job Interpolation algorithm Language model LM 1 LM 2 LM `n` ….................... Intermediate sub models Hive query Hive table Sub-model 1 training job INSERT OVERWRITE TABLE sub_model_1 SELECT .... FROM ( REDUCE m.ngram, m.group_key, m.count USING "./train_model --config=myconfig.json ....” AS `ngram`, `count`, ... FROM( SELECT ... FROM data_source WHERE ... DISTRIBUTE BY group_key ) ) GROUP BY `ngram`
  11. 11. Lessons learned • SQL not good choice for building such applications • Duplication • Poor readability • Brittle, no testing • Alternatives • Map-reduce • Query templating • Latency while training with large data
  12. 12. Spark solution
  13. 13. Spark solution • Same high level architecture • Hive tables as final inputs and outputs • Same binaries used in Hive TRANSFORM • RDD not Datasets • `pipe()` operator • Modular, readable, maintainable
  14. 14. Configuration PipelineConfiguration - where is the input data ? - where to store final output ? - spark specific configs: "spark.dynamicAllocation.maxExecutors” "spark.executor.memory” "spark.memory.storageFraction” ………… - list of ComponentConfiguration ……
  15. 15. Scalability challenges • Executors lost as unable to heartbeat • Shuffle service OOM • Frequent executor GC • Executor OOM • 2GB limit in Spark for blocks • Exceptions while reading output stream of pipe process
  16. 16. Scalability challenges • Executors lost as unable to heartbeat • Shuffle service OOM • Frequent executor GC • Executor OOM • 2GB limit in Spark for blocks • Exceptions while reading output stream of pipe process
  17. 17. Data skew
  18. 18. Sub-model 2 training job Sub-model `n` training job Interpolation algorithm Language model LM 1 LM 2 LM `n` ….................... Intermediate sub models Hive query Hive table Sub-model 1 training job
  19. 19. Sub-model 2 training job Sub-model `n` training job Interpolation algorithm Language model LM 1 LM 2 LM `n` ….................... Intermediate sub models Hive query Hive table Sub-model 1 training job ngram extraction and counting Estimation and pruning normalize ngram counts
  20. 20. How are you How are they Its raining How are we going When are we going You are awesome They are working ….. ….. Training dataset <How are we going> : 1 …. <How are you> : 1 <How are they> : 1 …. <How are> : 4 <You are> : 1 <Its raining> : 1 …. <are> : 6 <you> : 1 <How> : 4 ….. Word count
  21. 21. <How are we going> : 1 <are we going> : 2 <we going> : 2 <going> : 1 <When are we going> : 1 <Its raining> : 1 <You are awesome> : 1 ….. ….. Word count Partition based on 2-word suffix
  22. 22. <How are we going> : 1 <are we going> : 2 <we going> : 2 <When are we going> : 1 ….. <Its raining> : 1 <You are awesome> : 1 ….. Word count <How are we going> : 1 <are we going> : 2 <we going> : 2 <going> : 1 <When are we going> : 1 <Its raining> : 1 <You are awesome> : 1 ….. ….. ….. …..
  23. 23. <are> : 6 <How> : 4 <you> : 1 <doing> : 1 <going> : 1 <awesome> : 1 <working> : 1 ….. ….. Frequency of every word: 0’th shard
  24. 24. shards 1 to (n-1) 0-shard (has frequency of every word) and is shipped to all the nodes N-grams with same 2- word suffix will fall in the same shard Distribution of shards (1-word sharding)
  25. 25. Skewed shards due to data from frequent phrases eg. “how to ..”, “do you ..” shards 1 to (n-1) Distribution of shards (1-word sharding)
  26. 26. shards 1 to (n-1) 0-shard has single word frequencies and 2-word frequencies as well Distribution of shards (2-word sharding)
  27. 27. Solution: Progressive sharding
  28. 28. First iteration Ignore skewed shards
  29. 29. def findLargeShardIds(sc: SparkContext, threshold: Long, …..): Set[Int] = { val shardSizesRDD = sc.textFile(shardCountsFile) .map { case line => val Array(indexStr, countStr) = line.split('t') (indexStr.toInt, countStr.toLong) } val largeShardIds = shardSizesRDD.filter { case (index, count) => count > threshold }.map(_._1) .collect().toSet return largeShardIds }
  30. 30. First iteration Process all the non-skewed shards
  31. 31. Second iteration Effective 0-shard is small Re-shard left over with 2-words history
  32. 32. Second iteration Discard bigger shards
  33. 33. Second iteration Process all the non-skewed shards
  34. 34. Continue with further iterations ….
  35. 35. var iterationId = 0 do { val currentCounts: RDD[(String, Long)] = allCounts(iterationId - 1) val partitioner = new PartitionerForNgram(numShards, iterationId) val shardCountsFile = s"${shard_sizes}_$iterationId" currentCounts .map(ngram => (partitioner.getPartition(ngram._1), 1L)) .reduceByKey(_ + _) .saveAsTextFile(shardCountsFile) largeShardIds = findLargeShardIds(sc, config.largeShardThreshold, shardCountsFile) trainer.trainedModel (currentCounts, component, largeShardIds) .saveAsObjectFile(s"${component.order}_$iterationId") iterationId + 1 } while (largeShards.nonEmpty)
  36. 36. Performance evaluation 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Hive Spark Reserved CPU time (days) 0 1 2 3 4 5 6 7 8 9 Hive Spark Latency (hours)
  37. 37. 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Hive Spark Reserved CPU time (days) 15x efficient Performance evaluation 0 1 2 3 4 5 6 7 8 9 Hive Spark Latency (hours) 2.6x faster
  38. 38. Upstream contributions to pipe() • [SPARK-13793] PipedRDD doesn't propagate exceptions while reading parent RDD • [SPARK-15826] PipedRDD to allow configurable char encoding • [SPARK-14542] PipeRDD should allow configurable buffer size for the stdin writer • [SPARK-14110] PipedRDD to print the command ran on non zero exit
  39. 39. Questions ?

×