Spark 计算模型
wangxing
MapReduce
Spark
“Apache Spark is a fast and general-purpose cluster
computing system. It provides high-level APIs in Java, Scala, Python and R,
and an optimized engine that supports general execution graphs. It also
supports a rich set of higher-level tools including Spark SQL for SQL and
structured data processing, MLlib for machine learning, GraphX for graph
processing, and Spark Streaming.”
Unified Platform
Resilient Distributed Dataset
RDD is an immutable and partitioned collection:
● Resilient:
○ automatically recover from node failures
● Distributed:
○ partitioned across the nodes of the cluster that can be operated on in
parallel
● Dataset:
○ RDDs are created with a file in the Hadoop file system, or an existing
Scala collection
Create RDD
“There are two ways to create RDDs: parallelizing an existing collection in your
driver program, or referencing a dataset in an external storage system, such as
a shared filesystem, HDFS, HBase, or any data source offering a Hadoop
InputFormat.”
1. var mydata = sc.parallelize(Array(1, 2, 3, 4, 5))
2. var mydata = sc.makeRDD(1 to 100, 2)
3. var mydata = sc.textFile("hdfs://sa-onlinehdm1-cnc0.hlg:8020/tmp/1.txt")
Operate RDD
“Two types of operations: transformations, which create a new dataset(RDD)
from an existing one, and actions, which return a value to the driver program
after running a computation on the dataset.”
“All transformations in Spark are lazy, in that they do not compute their results
right away. Instead, they just remember the transformations applied to some
base dataset. The transformations are only computed when an action requires
a result to be returned to the driver program.
Other Control
● persist / cache / unpersist:
○ stores any partitions of the rdd in memory/disk
○ can be stored using a different storage level
○ allows future actions to be much faster,a key tool for iterative
algorithms
● checkpoint:
○ saves the RDD to disk, actually forgets the lineage of the RDD
completely. This is allows long lineages to be truncated.
Example: WordCount
1. var lines = sc.textFile(“hdfs://sa-onlinehdm1-cnc0.hlg01:8020/tmp/1.txt”)
2. var counts = lines.flatMap(line => line.split(“ ”))
3. .map(word => (word, 1))
4. .reduceByKey((a, b) => a + b)
5. counts.collect().foreach(println)
6. counts.saveAsTextFile(“hdfs://sa-onlinehdm1-cnc0.hlg01:8020/tmp/1.txt”)
Example: WordCount
scala> counts.toDebugString
(2) ShuffledRDD[8] at reduceByKey at <console>:23 []
+-(2) MapPartitionsRDD[7] at map at <console>:23 []
| MapPartitionsRDD[6] at flatMap at <console>:23 []
| MapPartitionsRDD[5] at textFile at <console>:21 []
| hdfs: //xxx HadoopRDD[4] at textFile at <console>:21 []
Dependency
Dependency
● Narrow dependencies:
○ allow for pipelined execution on one cluster node
○ easy fault recovery
● Wide dependencies:
○ require data from all parent partitions to be available and to be
shuffled across the nodes
○ a single failed node might cause a complete re-execution
DAG
● Directed:
○ Only in a single direction
● Acyclic:
○ No looping
Shuffle
● redistributes data among partitions
● partition keys into buckets
● write intermediate files to disk fetched by the next stage of tasks
Stage
● A stage is a set of independent tasks of a Spark job;
● DAG of tasks is split up into stages at the boundaries where shuffle occurs;
● DAGScheduler runs the stages in topological order;
● Each Stage can either be a shuffle map stage, in which case its tasks'
results are input for another stage, or a result stage, in which case its
tasks directly compute the action that initiated a job.
Stage
Stage
Job Schedule
Job Schedule
Spark Deploy
Spark Deploy
● Each application gets its own executor processes, isolating applications
from each other;
● The driver program must listen for and accept incoming connections from
its executors on the worker nodes;
● Because the driver schedules tasks on the cluster, it should be run close to
the worker nodes, preferably on the same local area network.
Tips
Use groupByKey/collect carefully
Use mapPartitions if initialization is heavy
Use Kryo serialization instead of java
Repartition if filter causes data skew
Spark Streaming
● Receives live input data streams and divides the data into batches of X
seconds;
● Treats each batch of data as RDDs and processes them using RDD
operations;
● The processed results of the RDD operations are returned in batches;
● Batch sizes as low as ½ sec, latency of about 1 sec.
Spark Streaming
● Receives live input data streams and divides the data into batches of X
seconds;
● Treats each batch of data as RDDs and processes them using RDD
operations;
● The processed results of the RDD operations are returned in batches;
● Batch sizes as low as ½ sec, latency of about 1 sec.
Spark Streaming
Windowed based computations allow you to apply transformations over a sliding window
of data. Every time the window slides over a source DStream, the source RDDs that fall
within the window are combined and operated upon to produce the RDDs of the
windowed DStream.
● window length - The duration of the window.
● sliding interval - The interval at which the window operation is performed.
Spark SQL
● Query data via SQL on DataFrame, a distributed collection of data organized into
named columns;
● DataFrame can created from an existing RDD, from a Hive table, or from data
sources;
● DataFrame can be operated on as normal RDDs and can also be registered as a
temporary table. Registering as a table allows you to run SQL queries over its data.
Example:Run sql on a text file.
Thanks

Spark 计算模型

  • 1.
  • 2.
  • 3.
    Spark “Apache Spark isa fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.”
  • 4.
  • 5.
    Resilient Distributed Dataset RDDis an immutable and partitioned collection: ● Resilient: ○ automatically recover from node failures ● Distributed: ○ partitioned across the nodes of the cluster that can be operated on in parallel ● Dataset: ○ RDDs are created with a file in the Hadoop file system, or an existing Scala collection
  • 6.
    Create RDD “There aretwo ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.” 1. var mydata = sc.parallelize(Array(1, 2, 3, 4, 5)) 2. var mydata = sc.makeRDD(1 to 100, 2) 3. var mydata = sc.textFile("hdfs://sa-onlinehdm1-cnc0.hlg:8020/tmp/1.txt")
  • 7.
    Operate RDD “Two typesof operations: transformations, which create a new dataset(RDD) from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.” “All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset. The transformations are only computed when an action requires a result to be returned to the driver program.
  • 8.
    Other Control ● persist/ cache / unpersist: ○ stores any partitions of the rdd in memory/disk ○ can be stored using a different storage level ○ allows future actions to be much faster,a key tool for iterative algorithms ● checkpoint: ○ saves the RDD to disk, actually forgets the lineage of the RDD completely. This is allows long lineages to be truncated.
  • 9.
    Example: WordCount 1. varlines = sc.textFile(“hdfs://sa-onlinehdm1-cnc0.hlg01:8020/tmp/1.txt”) 2. var counts = lines.flatMap(line => line.split(“ ”)) 3. .map(word => (word, 1)) 4. .reduceByKey((a, b) => a + b) 5. counts.collect().foreach(println) 6. counts.saveAsTextFile(“hdfs://sa-onlinehdm1-cnc0.hlg01:8020/tmp/1.txt”)
  • 10.
    Example: WordCount scala> counts.toDebugString (2)ShuffledRDD[8] at reduceByKey at <console>:23 [] +-(2) MapPartitionsRDD[7] at map at <console>:23 [] | MapPartitionsRDD[6] at flatMap at <console>:23 [] | MapPartitionsRDD[5] at textFile at <console>:21 [] | hdfs: //xxx HadoopRDD[4] at textFile at <console>:21 []
  • 11.
  • 12.
    Dependency ● Narrow dependencies: ○allow for pipelined execution on one cluster node ○ easy fault recovery ● Wide dependencies: ○ require data from all parent partitions to be available and to be shuffled across the nodes ○ a single failed node might cause a complete re-execution
  • 13.
    DAG ● Directed: ○ Onlyin a single direction ● Acyclic: ○ No looping
  • 14.
    Shuffle ● redistributes dataamong partitions ● partition keys into buckets ● write intermediate files to disk fetched by the next stage of tasks
  • 15.
    Stage ● A stageis a set of independent tasks of a Spark job; ● DAG of tasks is split up into stages at the boundaries where shuffle occurs; ● DAGScheduler runs the stages in topological order; ● Each Stage can either be a shuffle map stage, in which case its tasks' results are input for another stage, or a result stage, in which case its tasks directly compute the action that initiated a job.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
    Spark Deploy ● Eachapplication gets its own executor processes, isolating applications from each other; ● The driver program must listen for and accept incoming connections from its executors on the worker nodes; ● Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network.
  • 22.
    Tips Use groupByKey/collect carefully UsemapPartitions if initialization is heavy Use Kryo serialization instead of java Repartition if filter causes data skew
  • 23.
    Spark Streaming ● Receiveslive input data streams and divides the data into batches of X seconds; ● Treats each batch of data as RDDs and processes them using RDD operations; ● The processed results of the RDD operations are returned in batches; ● Batch sizes as low as ½ sec, latency of about 1 sec.
  • 24.
    Spark Streaming ● Receiveslive input data streams and divides the data into batches of X seconds; ● Treats each batch of data as RDDs and processes them using RDD operations; ● The processed results of the RDD operations are returned in batches; ● Batch sizes as low as ½ sec, latency of about 1 sec.
  • 25.
    Spark Streaming Windowed basedcomputations allow you to apply transformations over a sliding window of data. Every time the window slides over a source DStream, the source RDDs that fall within the window are combined and operated upon to produce the RDDs of the windowed DStream. ● window length - The duration of the window. ● sliding interval - The interval at which the window operation is performed.
  • 26.
    Spark SQL ● Querydata via SQL on DataFrame, a distributed collection of data organized into named columns; ● DataFrame can created from an existing RDD, from a Hive table, or from data sources; ● DataFrame can be operated on as normal RDDs and can also be registered as a temporary table. Registering as a table allows you to run SQL queries over its data. Example:Run sql on a text file.
  • 27.

Editor's Notes

  • #3 每一轮计算都有Map阶段和Reduce阶段,需要把计算步骤转换成若干轮的MapReduce。 Reduce的输出数据需要存储到磁盘,中间的shuffle有大量网络传输。因此MapReduce具有高延迟,不适合进行迭代计算的特点。 跑SQL需要搭建Hive,跑机器学习算法需要使用Mahout。
  • #4 Spark将数据切片后放内存中计算,速度是MapReduce的100倍之上。 Spark让开发者可以快速的用Java、Scala或Python编写程序。 极具通用性,使用统一技术栈解决SQL查询,流数据处理,机器学习和基于图的计算。
  • #6 RDD之前可以相互依赖。 Automatically rebuild on failure Persistence for reuse (RAM and/or disk)
  • #12 1. 如果RDD的每个分区最多只能被一个Child RDD的一个分区使用,则称之为narrow dependency;若被多个Child RDD分区都依赖,则为wide dependency。
  • #21 Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers, which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code to the executors. Finally, SparkContext sends tasks to the executors to run.