3. Spark
“Apache Spark is a fast and general-purpose cluster
computing system. It provides high-level APIs in Java, Scala, Python and R,
and an optimized engine that supports general execution graphs. It also
supports a rich set of higher-level tools including Spark SQL for SQL and
structured data processing, MLlib for machine learning, GraphX for graph
processing, and Spark Streaming.”
5. Resilient Distributed Dataset
RDD is an immutable and partitioned collection:
● Resilient:
○ automatically recover from node failures
● Distributed:
○ partitioned across the nodes of the cluster that can be operated on in
parallel
● Dataset:
○ RDDs are created with a file in the Hadoop file system, or an existing
Scala collection
6. Create RDD
“There are two ways to create RDDs: parallelizing an existing collection in your
driver program, or referencing a dataset in an external storage system, such as
a shared filesystem, HDFS, HBase, or any data source offering a Hadoop
InputFormat.”
1. var mydata = sc.parallelize(Array(1, 2, 3, 4, 5))
2. var mydata = sc.makeRDD(1 to 100, 2)
3. var mydata = sc.textFile("hdfs://sa-onlinehdm1-cnc0.hlg:8020/tmp/1.txt")
7. Operate RDD
“Two types of operations: transformations, which create a new dataset(RDD)
from an existing one, and actions, which return a value to the driver program
after running a computation on the dataset.”
“All transformations in Spark are lazy, in that they do not compute their results
right away. Instead, they just remember the transformations applied to some
base dataset. The transformations are only computed when an action requires
a result to be returned to the driver program.
8. Other Control
● persist / cache / unpersist:
○ stores any partitions of the rdd in memory/disk
○ can be stored using a different storage level
○ allows future actions to be much faster,a key tool for iterative
algorithms
● checkpoint:
○ saves the RDD to disk, actually forgets the lineage of the RDD
completely. This is allows long lineages to be truncated.
9. Example: WordCount
1. var lines = sc.textFile(“hdfs://sa-onlinehdm1-cnc0.hlg01:8020/tmp/1.txt”)
2. var counts = lines.flatMap(line => line.split(“ ”))
3. .map(word => (word, 1))
4. .reduceByKey((a, b) => a + b)
5. counts.collect().foreach(println)
6. counts.saveAsTextFile(“hdfs://sa-onlinehdm1-cnc0.hlg01:8020/tmp/1.txt”)
10. Example: WordCount
scala> counts.toDebugString
(2) ShuffledRDD[8] at reduceByKey at <console>:23 []
+-(2) MapPartitionsRDD[7] at map at <console>:23 []
| MapPartitionsRDD[6] at flatMap at <console>:23 []
| MapPartitionsRDD[5] at textFile at <console>:21 []
| hdfs: //xxx HadoopRDD[4] at textFile at <console>:21 []
12. Dependency
● Narrow dependencies:
○ allow for pipelined execution on one cluster node
○ easy fault recovery
● Wide dependencies:
○ require data from all parent partitions to be available and to be
shuffled across the nodes
○ a single failed node might cause a complete re-execution
14. Shuffle
● redistributes data among partitions
● partition keys into buckets
● write intermediate files to disk fetched by the next stage of tasks
15. Stage
● A stage is a set of independent tasks of a Spark job;
● DAG of tasks is split up into stages at the boundaries where shuffle occurs;
● DAGScheduler runs the stages in topological order;
● Each Stage can either be a shuffle map stage, in which case its tasks'
results are input for another stage, or a result stage, in which case its
tasks directly compute the action that initiated a job.
21. Spark Deploy
● Each application gets its own executor processes, isolating applications
from each other;
● The driver program must listen for and accept incoming connections from
its executors on the worker nodes;
● Because the driver schedules tasks on the cluster, it should be run close to
the worker nodes, preferably on the same local area network.
23. Spark Streaming
● Receives live input data streams and divides the data into batches of X
seconds;
● Treats each batch of data as RDDs and processes them using RDD
operations;
● The processed results of the RDD operations are returned in batches;
● Batch sizes as low as ½ sec, latency of about 1 sec.
24. Spark Streaming
● Receives live input data streams and divides the data into batches of X
seconds;
● Treats each batch of data as RDDs and processes them using RDD
operations;
● The processed results of the RDD operations are returned in batches;
● Batch sizes as low as ½ sec, latency of about 1 sec.
25. Spark Streaming
Windowed based computations allow you to apply transformations over a sliding window
of data. Every time the window slides over a source DStream, the source RDDs that fall
within the window are combined and operated upon to produce the RDDs of the
windowed DStream.
● window length - The duration of the window.
● sliding interval - The interval at which the window operation is performed.
26. Spark SQL
● Query data via SQL on DataFrame, a distributed collection of data organized into
named columns;
● DataFrame can created from an existing RDD, from a Hive table, or from data
sources;
● DataFrame can be operated on as normal RDDs and can also be registered as a
temporary table. Registering as a table allows you to run SQL queries over its data.
Example:Run sql on a text file.
Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers, which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code to the executors. Finally, SparkContext sends tasks to the executors to run.