Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Internals - Hadoop Source Code Reading #16 in Japan

19,151 views

Published on

Published in: Technology

Spark Internals - Hadoop Source Code Reading #16 in Japan

  1. 1. Spark Internals 1
  2. 2. Spark Internals Spark Code Base Size  spark/core/src/main/scala  2012 (version 0.6.x)  20,000 lines of code  2014 (branch-1.0)  50,000 lines of code  Other components  Spark Streaming  Bagel (graph processing library)  MLLib (machine learning library)  Container support: Mesos, YARN, Docker, etc.  Spark SQL (Shark: Hive on Spark) 2
  3. 3. Spark Internals Spark Core Developers 3
  4. 4. Spark Internals IntelliJ Tips  Install Scala Plugin  Useful commands for code reading  Go to definition (Ctrl + Click)  Show Usage  Navigate Class/Symbol/File  Bookmark, Show Bookmarks  Ctrl + Q (Show type info)  Find Action (Ctrl + Shift + A)  Use your favorite key bindings 4
  5. 5. Spark Internals Scala Console (REPL)  $ brew install scala 5
  6. 6. Spark Internals Scala Basics  object  Singleton, static methods  Package-private scope  private[spark] visible only from spark package.  Pattern matching 6
  7. 7. Spark Internals Scala: Case Classes  Case classes  Immutable and serializable  Can be used with pattern match. 7
  8. 8. Spark Internals Scala Cookbook  http://xerial.org/scala-cookbook 8
  9. 9. Components sc = new SparkContext f = sc.textFile(“…”) f.filter(…) .count() ... Your program Spark client (app master) Spark worker HDFS, HBase, … Block manager Task threads RDD graph Scheduler Block tracker Shuffle tracker Cluster manager Spark Internals https://cwiki.apache.org/confluence/di splay/SPARK/Spark+Internals
  10. 10. Scheduling Process rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects build operator DAG agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed Spark Internals https://cwiki.apache.org/confluence/di splay/SPARK/Spark+Internals
  11. 11. Spark Internals RDD  Reference  M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012, April 2012  SparkContext  Contains SparkConfig, Scheduler, entry point of running jobs (runJobs)  Dependency  Input RDDs 11
  12. 12. Spark Internals RDD.map operation  Map: RDD[T] -> RDD[U]  MappedRDD  For each element in a partition, apply function f 12
  13. 13. Spark Internals RDD Iterator 13  First, check the local cache  If not found, compute the RDD  StorageLevel  Off-heap   distributed memory store
  14. 14. Spark Internals Task  DAGScheduler organizes stages  Each stage has several tasks  Each task has preferred locations (host names)  14
  15. 15. Spark Internals Task Locality  Preferred location to run a task  Process, Node, Rack 15
  16. 16. Spark Internals Delay Scheduling  Reference  M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker and I. Stoica. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling, EuroSys 2010, April 2010.  Try to run tasks in the following order:  Local  Rack local   At any node  16
  17. 17. Spark Internals Serializing Tasks  TaskDescription  ResultTask  RDD  Function  Stage ID, outputID  func  17
  18. 18. Spark Internals TaskScheduler: submitTasks  Serialize Task Request  Then, send task requests to ExecutorBackend  ExecutorBackend handles task requests (Akka Actor) 18
  19. 19. Spark Internals ClosureSerializer  Clean  Function in scala: Closure  Closure: free variable + function body (class)   class A$apply$1 extends Function1[T, U] { val $outer : A$outer def apply(T:input) : U = … }  class A$outer { val N = 100, val M = (large object) }  Fill M with null, then serialize the closure. 19
  20. 20. Spark Internals Traversing Byte Codes  Closure is a class in Scala  Traverse outer variable accesses  Using ASM4 library 20
  21. 21. Spark Internals JVM Bytecode Instructions 21
  22. 22. Spark Internals Cache/Block Manager  CacheManager  Stores computed RDDs to BlockManager  BlockManager  Write-once storage  Manages block data according to StorageLevel     Serializes/deserializes block data   Compression    Faster decompression 22
  23. 23. Spark Internals Storing Block Data  IteratorValues  Raw objects  ArrayBufferValues  Array[Byte]  ByteBufferValues  ByteBuffer 23
  24. 24. Spark Internals ConnectionManager  Asynchronous Data I/O server  Using its own protocol  Send and receive block data (BufferMessage)   24
  25. 25. Spark Internals RDD.compute  Local Collection 25
  26. 26. Spark Internals SparkContext - RunJob  RDD -> DAG Scheduler 26
  27. 27. Spark Internals SparkConf  Key-Value configuration  Master address, jar file address, environment variables, JAVA_OPTS, etc. 27
  28. 28. Spark Internals SparkEnv  Holding spark components 28
  29. 29. Spark Internals SparkContext.makeRDD  Convert local Seq[T] into RDD[T] 29
  30. 30. Spark Internals HadoopRDD  Reading HDFS data as (Key, Value) records 30
  31. 31. Spark Internals Mesos Scheduler – Fine Grained 31  Mesos  Offer slave resources  Scheduler  Determine resource usage  Task lists are stored in TaskScheduler  Launches JVM for each task  
  32. 32. Spark Internals Mesos Fine-Grained Executor 32
  33. 33. Spark Internals Mesos Fine-Grained Executor  spark-executor  Shell script for launching JVM 33
  34. 34. Spark Internals Coarse-grained Mesos Scheduler  Launches Spark executor on Mesos slave  Runs CoarseGrainedExecutorBackend 34
  35. 35. Spark Internals Coarse-grained ExecutorBackend  Akka Actor  Register itself to the master  Initialize the executor after response 35
  36. 36. Spark Internals Cleanup RDDs  ReferenceQueue  Notified when weakly referenced objects are garbage collected. 37
  37. 37. Copyright ©2014 Treasure Data. All Rights Reserved. 38 WE ARE HIRING!

×