Spark Internals - Hadoop Source Code Reading #16 in Japan

18,130 views

Published on

Published in: Technology
2 Comments
67 Likes
Statistics
Notes
No Downloads
Views
Total views
18,130
On SlideShare
0
From Embeds
0
Number of Embeds
6,482
Actions
Shares
0
Downloads
455
Comments
2
Likes
67
Embeds 0
No embeds

No notes for slide
  • Pattern matching
  • NOT a modified version of Hadoop
  • Spark Internals - Hadoop Source Code Reading #16 in Japan

    1. 1. Spark Internals 1
    2. 2. Spark Internals Spark Code Base Size  spark/core/src/main/scala  2012 (version 0.6.x)  20,000 lines of code  2014 (branch-1.0)  50,000 lines of code  Other components  Spark Streaming  Bagel (graph processing library)  MLLib (machine learning library)  Container support: Mesos, YARN, Docker, etc.  Spark SQL (Shark: Hive on Spark) 2
    3. 3. Spark Internals Spark Core Developers 3
    4. 4. Spark Internals IntelliJ Tips  Install Scala Plugin  Useful commands for code reading  Go to definition (Ctrl + Click)  Show Usage  Navigate Class/Symbol/File  Bookmark, Show Bookmarks  Ctrl + Q (Show type info)  Find Action (Ctrl + Shift + A)  Use your favorite key bindings 4
    5. 5. Spark Internals Scala Console (REPL)  $ brew install scala 5
    6. 6. Spark Internals Scala Basics  object  Singleton, static methods  Package-private scope  private[spark] visible only from spark package.  Pattern matching 6
    7. 7. Spark Internals Scala: Case Classes  Case classes  Immutable and serializable  Can be used with pattern match. 7
    8. 8. Spark Internals Scala Cookbook  http://xerial.org/scala-cookbook 8
    9. 9. Components sc = new SparkContext f = sc.textFile(“…”) f.filter(…) .count() ... Your program Spark client (app master) Spark worker HDFS, HBase, … Block manager Task threads RDD graph Scheduler Block tracker Shuffle tracker Cluster manager Spark Internals https://cwiki.apache.org/confluence/di splay/SPARK/Spark+Internals
    10. 10. Scheduling Process rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects build operator DAG agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed Spark Internals https://cwiki.apache.org/confluence/di splay/SPARK/Spark+Internals
    11. 11. Spark Internals RDD  Reference  M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012, April 2012  SparkContext  Contains SparkConfig, Scheduler, entry point of running jobs (runJobs)  Dependency  Input RDDs 11
    12. 12. Spark Internals RDD.map operation  Map: RDD[T] -> RDD[U]  MappedRDD  For each element in a partition, apply function f 12
    13. 13. Spark Internals RDD Iterator 13  First, check the local cache  If not found, compute the RDD  StorageLevel  Off-heap   distributed memory store
    14. 14. Spark Internals Task  DAGScheduler organizes stages  Each stage has several tasks  Each task has preferred locations (host names)  14
    15. 15. Spark Internals Task Locality  Preferred location to run a task  Process, Node, Rack 15
    16. 16. Spark Internals Delay Scheduling  Reference  M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker and I. Stoica. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling, EuroSys 2010, April 2010.  Try to run tasks in the following order:  Local  Rack local   At any node  16
    17. 17. Spark Internals Serializing Tasks  TaskDescription  ResultTask  RDD  Function  Stage ID, outputID  func  17
    18. 18. Spark Internals TaskScheduler: submitTasks  Serialize Task Request  Then, send task requests to ExecutorBackend  ExecutorBackend handles task requests (Akka Actor) 18
    19. 19. Spark Internals ClosureSerializer  Clean  Function in scala: Closure  Closure: free variable + function body (class)   class A$apply$1 extends Function1[T, U] { val $outer : A$outer def apply(T:input) : U = … }  class A$outer { val N = 100, val M = (large object) }  Fill M with null, then serialize the closure. 19
    20. 20. Spark Internals Traversing Byte Codes  Closure is a class in Scala  Traverse outer variable accesses  Using ASM4 library 20
    21. 21. Spark Internals JVM Bytecode Instructions 21
    22. 22. Spark Internals Cache/Block Manager  CacheManager  Stores computed RDDs to BlockManager  BlockManager  Write-once storage  Manages block data according to StorageLevel     Serializes/deserializes block data   Compression    Faster decompression 22
    23. 23. Spark Internals Storing Block Data  IteratorValues  Raw objects  ArrayBufferValues  Array[Byte]  ByteBufferValues  ByteBuffer 23
    24. 24. Spark Internals ConnectionManager  Asynchronous Data I/O server  Using its own protocol  Send and receive block data (BufferMessage)   24
    25. 25. Spark Internals RDD.compute  Local Collection 25
    26. 26. Spark Internals SparkContext - RunJob  RDD -> DAG Scheduler 26
    27. 27. Spark Internals SparkConf  Key-Value configuration  Master address, jar file address, environment variables, JAVA_OPTS, etc. 27
    28. 28. Spark Internals SparkEnv  Holding spark components 28
    29. 29. Spark Internals SparkContext.makeRDD  Convert local Seq[T] into RDD[T] 29
    30. 30. Spark Internals HadoopRDD  Reading HDFS data as (Key, Value) records 30
    31. 31. Spark Internals Mesos Scheduler – Fine Grained 31  Mesos  Offer slave resources  Scheduler  Determine resource usage  Task lists are stored in TaskScheduler  Launches JVM for each task  
    32. 32. Spark Internals Mesos Fine-Grained Executor 32
    33. 33. Spark Internals Mesos Fine-Grained Executor  spark-executor  Shell script for launching JVM 33
    34. 34. Spark Internals Coarse-grained Mesos Scheduler  Launches Spark executor on Mesos slave  Runs CoarseGrainedExecutorBackend 34
    35. 35. Spark Internals Coarse-grained ExecutorBackend  Akka Actor  Register itself to the master  Initialize the executor after response 35
    36. 36. Spark Internals Cleanup RDDs  ReferenceQueue  Notified when weakly referenced objects are garbage collected. 37
    37. 37. Copyright ©2014 Treasure Data. All Rights Reserved. 38 WE ARE HIRING!

    ×