Advertisement
Advertisement

More Related Content

Advertisement
Advertisement

Apache Spark Architecture

  1. 1Pivotal Confidential–Internal Use Only 1Pivotal Confidential–Internal Use Only Spark Architecture A.Grishchenko
  2. About me Enterprise Architect @ Pivotal  7 years in data processing  5 years with MPP  4 years with Hadoop  Spark contributor  http://0x0fff.com
  3. Outline  Spark Motivation  Spark Pillars  Spark Architecture  Spark Shuffle  Spark DataFrame
  4. Outline  Spark Motivation  Spark Pillars  Spark Architecture  Spark Shuffle  Spark DataFrame
  5. Spark Motivation  Difficultly of programming directly in Hadoop MapReduce
  6. Spark Motivation  Difficultly of programming directly in Hadoop MapReduce  Performance bottlenecks, or batch not fitting use cases
  7. Spark Motivation  Difficultly of programming directly in Hadoop MapReduce  Performance bottlenecks, or batch not fitting use cases  Better support iterative jobs typical for machine learning
  8. Difficulty of Programming in MR Word Count implementations  Hadoop MR – 61 lines in Java  Spark – 1 line in interactive shell sc.textFile('...').flatMap(lambda x: x.split()) .map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y) .saveAsTextFile('...') VS
  9. Performance Bottlenecks How many times the data is put to the HDD during a single MapReduce Job?  One  Two  Three  More
  10. Performance Bottlenecks How many times the data is put to the HDD during a single MapReduce Job?  One  Two  Three  More
  11. Performance Bottlenecks Consider Hive as main SQL tool
  12. Performance Bottlenecks Consider Hive as main SQL tool  Typical Hive query is translated to 3-5 MR jobs
  13. Performance Bottlenecks Consider Hive as main SQL tool  Typical Hive query is translated to 3-5 MR jobs  Each MR would scan put data to HDD 3+ times
  14. Performance Bottlenecks Consider Hive as main SQL tool  Typical Hive query is translated to 3-5 MR jobs  Each MR would scan put data to HDD 3+ times  Each put to HDD – write followed by read
  15. Performance Bottlenecks Consider Hive as main SQL tool  Typical Hive query is translated to 3-5 MR jobs  Each MR would scan put data to HDD 3+ times  Each put to HDD – write followed by read  Sums up to 18-30 scans of data during a single Hive query
  16. Performance Bottlenecks Spark offers you  Lazy Computations – Optimize the job before executing
  17. Performance Bottlenecks Spark offers you  Lazy Computations – Optimize the job before executing  In-memory data caching – Scan HDD only once, then scan your RAM
  18. Performance Bottlenecks Spark offers you  Lazy Computations – Optimize the job before executing  In-memory data caching – Scan HDD only once, then scan your RAM  Efficient pipelining – Avoids the data hitting the HDD by all means
  19. Outline  Spark Motivation  Spark Pillars  Spark Architecture  Spark Shuffle  Spark DataFrame
  20. Spark Pillars Two main abstractions of Spark
  21. Spark Pillars Two main abstractions of Spark  RDD – Resilient Distributed Dataset
  22. Spark Pillars Two main abstractions of Spark  RDD – Resilient Distributed Dataset  DAG – Direct Acyclic Graph
  23. RDD  Simple view – RDD is collection of data items split into partitions and stored in memory on worker nodes of the cluster
  24. RDD  Simple view – RDD is collection of data items split into partitions and stored in memory on worker nodes of the cluster  Complex view – RDD is an interface for data transformation
  25. RDD  Simple view – RDD is collection of data items split into partitions and stored in memory on worker nodes of the cluster  Complex view – RDD is an interface for data transformation – RDD refers to the data stored either in persisted store (HDFS, Cassandra, HBase, etc.) or in cache (memory, memory+disks, disk only, etc.) or in another RDD
  26. RDD  Complex view (cont’d) – Partitions are recomputed on failure or cache eviction
  27. RDD  Complex view (cont’d) – Partitions are recomputed on failure or cache eviction – Metadata stored for interface ▪ Partitions – set of data splits associated with this RDD
  28. RDD  Complex view (cont’d) – Partitions are recomputed on failure or cache eviction – Metadata stored for interface ▪ Partitions – set of data splits associated with this RDD ▪ Dependencies – list of parent RDDs involved in computation
  29. RDD  Complex view (cont’d) – Partitions are recomputed on failure or cache eviction – Metadata stored for interface ▪ Partitions – set of data splits associated with this RDD ▪ Dependencies – list of parent RDDs involved in computation ▪ Compute – function to compute partition of the RDD given the parent partitions from the Dependencies
  30. RDD  Complex view (cont’d) – Partitions are recomputed on failure or cache eviction – Metadata stored for interface ▪ Partitions – set of data splits associated with this RDD ▪ Dependencies – list of parent RDDs involved in computation ▪ Compute – function to compute partition of the RDD given the parent partitions from the Dependencies ▪ Preferred Locations – where is the best place to put computations on this partition (data locality)
  31. RDD  Complex view (cont’d) – Partitions are recomputed on failure or cache eviction – Metadata stored for interface ▪ Partitions – set of data splits associated with this RDD ▪ Dependencies – list of parent RDDs involved in computation ▪ Compute – function to compute partition of the RDD given the parent partitions from the Dependencies ▪ Preferred Locations – where is the best place to put computations on this partition (data locality) ▪ Partitioner – how the data is split into partitions
  32. RDD  RDD is the main and only tool for data manipulation in Spark  Two classes of operations – Transformations – Actions
  33. RDD Lazy computations model Transformation cause only metadata change
  34. DAG Direct Acyclic Graph – sequence of computations performed on data
  35. DAG Direct Acyclic Graph – sequence of computations performed on data  Node – RDD partition
  36. DAG Direct Acyclic Graph – sequence of computations performed on data  Node – RDD partition  Edge – transformation on top of data
  37. DAG Direct Acyclic Graph – sequence of computations performed on data  Node – RDD partition  Edge – transformation on top of data  Acyclic – graph cannot return to the older partition
  38. DAG Direct Acyclic Graph – sequence of computations performed on data  Node – RDD partition  Edge – transformation on top of data  Acyclic – graph cannot return to the older partition  Direct – transformation is an action that transitions data partition state (from A to B)
  39. DAG WordCount example
  40. DAG WordCount example HDFSInputSplits HDFS RDDPartitions RDD RDD RDD RDD RDD sc.textFile(‘hdfs://…’) flatMap map reduceByKey foreach
  41. Outline  Spark Motivation  Spark Pillars  Spark Architecture  Spark Shuffle  Spark DataFrames
  42. Spark Cluster Driver Node … Worker Node Worker Node Worker Node
  43. Spark Cluster Driver Node Driver … Worker Node … Executor Executor Worker Node … Executor Executor Worker Node … Executor Executor
  44. Spark Cluster Driver Node Driver Spark Context … Worker Node … Executor Cache Executor Cache Worker Node … Executor Cache Executor Cache Worker Node … Executor Cache Executor Cache
  45. Spark Cluster Driver Node Driver Spark Context … Worker Node … Executor Cache Task … Task Task Executor Cache Task … Task Task Worker Node … Executor Cache Task … Task Task Executor Cache Task … Task Task Worker Node … Executor Cache Task … Task Task Executor Cache Task … Task Task
  46. Spark Cluster  Driver – Entry point of the Spark Shell (Scala, Python, R)
  47. Spark Cluster  Driver – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext is created
  48. Spark Cluster  Driver – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext is created – Translates RDD into the execution graph
  49. Spark Cluster  Driver – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext is created – Translates RDD into the execution graph – Splits graph into stages
  50. Spark Cluster  Driver – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext is created – Translates RDD into the execution graph – Splits graph into stages – Schedules tasks and controls their execution
  51. Spark Cluster  Driver – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext is created – Translates RDD into the execution graph – Splits graph into stages – Schedules tasks and controls their execution – Stores metadata about all the RDDs and their partitions
  52. Spark Cluster  Driver – Entry point of the Spark Shell (Scala, Python, R) – The place where SparkContext is created – Translates RDD into the execution graph – Splits graph into stages – Schedules tasks and controls their execution – Stores metadata about all the RDDs and their partitions – Brings up Spark WebUI with job information
  53. Spark Cluster  Executor – Stores the data in cache in JVM heap or on HDDs
  54. Spark Cluster  Executor – Stores the data in cache in JVM heap or on HDDs – Reads data from external sources
  55. Spark Cluster  Executor – Stores the data in cache in JVM heap or on HDDs – Reads data from external sources – Writes data to external sources
  56. Spark Cluster  Executor – Stores the data in cache in JVM heap or on HDDs – Reads data from external sources – Writes data to external sources – Performs all the data processing
  57. Executor Memory
  58. Spark Cluster – Detailed
  59. Spark Cluster – PySpark
  60. Application Decomposition  Application – Single instance of SparkContext that stores some data processing logic and can schedule series of jobs, sequentially or in parallel (SparkContext is thread-safe)
  61. Application Decomposition  Application – Single instance of SparkContext that stores some data processing logic and can schedule series of jobs, sequentially or in parallel (SparkContext is thread-safe)  Job – Complete set of transformations on RDD that finishes with action or data saving, triggered by the driver application
  62. Application Decomposition  Stage – Set of transformations that can be pipelined and executed by a single independent worker. Usually it is app the transformations between “read”, “shuffle”, “action”, “save”
  63. Application Decomposition  Stage – Set of transformations that can be pipelined and executed by a single independent worker. Usually it is app the transformations between “read”, “shuffle”, “action”, “save”  Task – Execution of the stage on a single data partition. Basic unit of scheduling
  64. WordCount ExampleHDFSInputSplits HDFS RDDPartitions RDD RDD RDD RDD RDD sc.textFile(‘hdfs://…’) flatMap map reduceByKey foreach
  65. Stage 1 WordCount ExampleHDFSInputSplits HDFS RDDPartitions RDD RDD RDD RDD RDD sc.textFile(‘hdfs://…’) flatMap map reduceByKey foreach
  66. Stage 2Stage 1 WordCount ExampleHDFSInputSplits HDFS RDDPartitions RDD RDD RDD RDD RDD sc.textFile(‘hdfs://…’) flatMap map reduceByKey foreach
  67. Stage 2Stage 1 WordCount ExampleHDFSInputSplits HDFS RDDPartitions RDD RDD RDD RDD RDD sc.textFile(‘hdfs://…’) flatMap map reduceByKey foreach Task 1 Task 2 Task 3 Task 4
  68. Stage 2Stage 1 WordCount ExampleHDFSInputSplits HDFS RDDPartitions RDD RDD RDD RDD RDD sc.textFile(‘hdfs://…’) flatMap map reduceByKey foreach Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8
  69. Stage 2Stage 1 WordCount ExampleHDFSInputSplits HDFS pipeline Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 partition shuffle pipeline
  70. Persistence in Spark Persistence Level Description MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISK_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, DISK_ONLY_2, etc. Same as the levels above, but replicate each partition on two cluster nodes.
  71. Persistence in Spark  Spark considers memory as a cache with LRU eviction rules  If “Disk” is involved, data is evicted to disks rdd = sc.parallelize(xrange(1000)) rdd.cache().count() rdd.persist(StorageLevel.MEMORY_AND_DISK_SER).count() rdd.unpersist()
  72. Outline  Spark Motivation  Spark Pillars  Spark Architecture  Spark Shuffle  Spark DataFrame
  73. Shuffles in Spark  Hash Shuffle – default prior to 1.2.0
  74. Shuffles in Spark  Hash Shuffle – default prior to 1.2.0  Sort Shuffle – default now
  75. Shuffles in Spark  Hash Shuffle – default prior to 1.2.0  Sort Shuffle – default now  Tungsten Sort – new optimized one!
  76. Hash Shuffle
  77. Executor JVM Partition Partition Partition Partition Partition Partition Partition Partition Hash Shuffle
  78. Executor JVM Partition Partition Partition Partition Partition Partition Partition Partition “map” task “map” task spark.executor.cores/ spark.task.cpus … Hash Shuffle
  79. Local DirectoryExecutor JVM Partition Partition Partition Partition Partition Partition Partition Partition “map” task “map” task spark.executor.cores/ spark.task.cpus … Output File Output File Output File … Numberof “Reducers” spark.local.dir Hash Shuffle
  80. Local DirectoryExecutor JVM Partition Partition Partition Partition Partition Partition Partition Partition “map” task “map” task spark.executor.cores/ spark.task.cpus … Output File Output File Output File … Numberof “Reducers” Output File Output File Output File … Numberof “Reducers” … spark.local.dir Hash Shuffle Numberof“map”tasks executedbythisexecutor
  81. Hash Shuffle with Consolidation
  82. Executor JVM Partition Partition Partition Partition Partition Partition Partition Partition Hash Shuffle With Consolidation
  83. Local DirectoryExecutor JVM Partition Partition Partition Partition Partition Partition Partition Partition “map” task … Numberof “Reducers” Output File Output File Output File spark.local.dir Hash Shuffle With Consolidationspark.executor.cores/ spark.task.cpus
  84. Local DirectoryExecutor JVM Partition Partition Partition Partition Partition Partition Partition Partition “map” task “map” task … … … Numberof “Reducers” spark.local.dir Hash Shuffle With Consolidationspark.executor.cores/ spark.task.cpus spark.executor.cores/ spark.task.cpus … Numberof “Reducers” Output File Output File Output File Output File Output File Output File
  85. Local DirectoryExecutor JVM Partition Partition Partition Partition Partition Partition Partition Partition “map” task … … … Numberof “Reducers” spark.local.dir Hash Shuffle With Consolidationspark.executor.cores/ spark.task.cpus spark.executor.cores/ spark.task.cpus … Numberof “Reducers” Output File Output File Output File Output File Output File Output File
  86. Local DirectoryExecutor JVM Partition Partition Partition Partition Partition Partition Partition Partition “map” task “map” task … … … Numberof “Reducers” spark.local.dir Hash Shuffle With Consolidationspark.executor.cores/ spark.task.cpus spark.executor.cores/ spark.task.cpus Output File Output File Output File … Numberof “Reducers” Output File Output File Output File
  87. Local DirectoryExecutor JVM Partition Partition Partition Partition Partition Partition Partition Partition “map” task … … … Numberof “Reducers” spark.local.dir Hash Shuffle With Consolidationspark.executor.cores/ spark.task.cpus spark.executor.cores/ spark.task.cpus Output File Output File Output File … Numberof “Reducers” Output File Output File Output File
  88. Local DirectoryExecutor JVM Partition Partition Partition Partition Partition Partition Partition Partition “map” task “map” task … … … Numberof “Reducers” spark.local.dir Hash Shuffle With Consolidationspark.executor.cores/ spark.task.cpus spark.executor.cores/ spark.task.cpus … Numberof “Reducers” Output File Output File Output File Output File Output File Output File
  89. Sort Shuffle
  90. Executor JVMPartition Partition Partition Partition Partition Partition Partition Partition Sort Shuffle spark.storage.[safetyFraction * memoryFraction] Partition Partition
  91. Executor JVMPartition Partition Partition Partition Partition Partition Partition Partition “map” task “map” task … Sort Shuffle spark.storage.[safetyFraction * memoryFraction] Partition Partition spark.executor.cores/spark.task.cpus
  92. Executor JVMPartition Partition Partition Partition Partition Partition Partition Partition “map” task “map” task … Sort Shuffle spark.storage.[safetyFraction * memoryFraction] spark.shuffle. [safetyFraction * memoryFraction] Partition Partition AppendOnlyMap … AppendOnlyMap spark.executor.cores/spark.task.cpus
  93. Executor JVMPartition Partition Partition Partition Partition Partition Partition Partition “map” task “map” task … Sort Shuffle spark.storage.[safetyFraction * memoryFraction] spark.shuffle. [safetyFraction * memoryFraction] Partition Partition AppendOnlyMap … AppendOnlyMap sort & spill spark.executor.cores/spark.task.cpus
  94. Executor JVMPartition Partition Partition Partition Partition Partition Partition Partition “map” task “map” task … Sort Shuffle spark.storage.[safetyFraction * memoryFraction] spark.shuffle. [safetyFraction * memoryFraction] Partition Partition AppendOnlyMap … AppendOnlyMap sort & spill Local Directory Output File index spark.executor.cores/spark.task.cpus
  95. Executor JVMPartition Partition Partition Partition Partition Partition Partition Partition “map” task “map” task … Sort Shuffle spark.storage.[safetyFraction * memoryFraction] spark.shuffle. [safetyFraction * memoryFraction] Partition Partition AppendOnlyMap … AppendOnlyMap sort & spill sort & spill Local Directory Output File index Output File index spark.executor.cores/spark.task.cpus
  96. Executor JVMPartition Partition Partition Partition Partition Partition Partition Partition “map” task “map” task … Sort Shuffle spark.storage.[safetyFraction * memoryFraction] spark.shuffle. [safetyFraction * memoryFraction] Partition Partition AppendOnlyMap … AppendOnlyMap sort & spill sort & spill sort & spill … Local Directory … Output File index Output File index Output File index spark.executor.cores/spark.task.cpus
  97. MinHeap Merge MinHeap Merge Executor JVMPartition Partition Partition Partition Partition Partition Partition Partition “map” task “map” task … Sort Shuffle spark.storage.[safetyFraction * memoryFraction] spark.shuffle. [safetyFraction * memoryFraction] Partition Partition AppendOnlyMap … AppendOnlyMap sort & spill sort & spill sort & spill … Local Directory … “reduce”task“reduce”task …Output File index Output File index Output File index spark.executor.cores/spark.task.cpus
  98. Tungsten Sort Shuffle
  99. Executor JVMPartition Partition Partition Partition Partition Partition Partition Partition Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction] Partition Partition
  100. Executor JVMPartition Partition Partition Partition Partition Partition Partition Partition “map” task Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction] Partition Partition spark.executor.cores/spark.task.cpus
  101. Executor JVMPartition Partition Partition Partition Partition Partition Partition Partition “map” task Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction] spark.shuffle. [safetyFraction * memoryFraction] Partition Partition Serialized Data LinkedList<MemoryBlock> Array of data pointers and Partition IDs, long[] spark.executor.cores/spark.task.cpus
  102. Executor JVMPartition Partition Partition Partition Partition Partition Partition Partition “map” task Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction] spark.shuffle. [safetyFraction * memoryFraction] Partition Partition Serialized Data LinkedList<MemoryBlock> Array of data pointers and Partition IDs, long[] sort & spill spark.executor.cores/spark.task.cpus
  103. Executor JVMPartition Partition Partition Partition Partition Partition Partition Partition “map” task Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction] spark.shuffle. [safetyFraction * memoryFraction] Partition Partition Serialized Data LinkedList<MemoryBlock> Local Directory Array of data pointers and Partition IDs, long[] sort & spill spark.local.dir Output File partition partition partition partition index spark.executor.cores/spark.task.cpus
  104. Executor JVMPartition Partition Partition Partition Partition Partition Partition Partition “map” task Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction] spark.shuffle. [safetyFraction * memoryFraction] Partition Partition Serialized Data LinkedList<MemoryBlock> Local Directory Array of data pointers and Partition IDs, long[] sort & spill sort & spill … spark.local.dir Output File partition partition partition partition index Output File partition partition partition partition index spark.executor.cores/spark.task.cpus
  105. Executor JVMPartition Partition Partition Partition Partition Partition Partition Partition “map” task “map” task … Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction] spark.shuffle. [safetyFraction * memoryFraction] Partition Partition Serialized Data LinkedList<MemoryBlock> Local Directory Array of data pointers and Partition IDs, long[] sort & spill sort & spill … spark.local.dir Output File partition partition partition partition index Output File partition partition partition partition index spark.executor.cores/spark.task.cpus
  106. Executor JVMPartition Partition Partition Partition Partition Partition Partition Partition “map” task “map” task … Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction] spark.shuffle. [safetyFraction * memoryFraction] Partition Partition Serialized Data LinkedList<MemoryBlock> … Local Directory Array of data pointers and Partition IDs, long[] Serialized Data LinkedList<MemoryBlock> Array of data pointers and Partition IDs, long[] sort & spill sort & spill … spark.local.dir Output File partition partition partition partition index Output File partition partition partition partition index spark.executor.cores/spark.task.cpus
  107. Executor JVMPartition Partition Partition Partition Partition Partition Partition Partition “map” task “map” task … Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction] spark.shuffle. [safetyFraction * memoryFraction] Partition Partition Serialized Data LinkedList<MemoryBlock> … Local Directory … Array of data pointers and Partition IDs, long[] Serialized Data LinkedList<MemoryBlock> Array of data pointers and Partition IDs, long[] sort & spill sort & spill … sort & spill spark.local.dir Output File partition partition partition partition index Output File partition partition partition partition index Output File partition partition partition partition index spark.executor.cores/spark.task.cpus
  108. Executor JVMPartition Partition Partition Partition Partition Partition Partition Partition “map” task “map” task … Tungsten Sort Shuffle spark.storage.[safetyFraction * memoryFraction] spark.shuffle. [safetyFraction * memoryFraction] Partition Partition Serialized Data LinkedList<MemoryBlock> … Local Directory … Array of data pointers and Partition IDs, long[] Serialized Data LinkedList<MemoryBlock> Array of data pointers and Partition IDs, long[] sort & spill sort & spill … sort & spill spark.local.dir Output File partition partition partition partition index Output File partition partition partition partition index Output File partition partition partition partition index Output File partition partition partition partition index merge spark.executor.cores/spark.task.cpus
  109. Outline  Spark Motivation  Spark Pillars  Spark Architecture  Spark Shuffle  Spark DataFrame
  110. DataFrame Idea
  111. DataFrame Implementation  Interface – DataFrame is an RDD with schema – field names, field data types and statistics – Unified transformation interface in all languages, all the transformations are passed to JVM – Can be accessed as RDD, in this case transformed to the RDD of Row objects
  112. DataFrame Implementation  Internals – Internally it is the same RDD – Data is stored in row-columnar format, row chunk size is set by spark.sql.inMemoryColumnarStorage.batchSize – Each column in each partition stores min-max values for partition pruning – Allows better compression ratio than standard RDD – Delivers faster performance for small subsets of columns
  113. 113Pivotal Confidential–Internal Use Only 113Pivotal Confidential–Internal Use Only Questions?
  114. BUILT FOR THE SPEED OF BUSINESS

Editor's Notes

  1. def printfunc (x): print 'Word "%s" occurs %d times' % (x[0], x[1]) infile = sc.textFile('hdfs://sparkdemo:8020/sparkdemo/textfiles/README.md', 4) rdd1 = infile.flatMap(lambda x: x.split()) rdd2 = rdd1.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y) print rdd2.toDebugString() rdd2.foreach(printfunc)
Advertisement