Spark is an open-source cluster computing framework that uses in-memory processing to allow data sharing across jobs for faster iterative queries and interactive analytics, it uses Resilient Distributed Datasets (RDDs) that can survive failures through lineage tracking and supports programming in Scala, Java, and Python for batch, streaming, and machine learning workloads.
Introduces Apache Spark, an open-source cluster computing framework for real-time data analysis, emphasizing in-memory data sharing, scalability, and programming model.
Details types of Spark deployments and its unified platform capabilities for big data analytics, highlighting compatibility with various storage systems.
Chronicles the history of Spark from its inception at UC Berkeley to its rise as a top-level Apache project, noting its active community contributions.
Compares Hadoop's data processing model with Spark's in-memory capabilities, emphasizing Spark's speed and efficiency over traditional disk-based processing.
Explains Spark's execution model focusing on the in-memory sharing of data, and outlines its core components and their functionalities.
Explores Spark's programming model based on RDDs (Resilient Distributed Datasets), detailing their properties, transformations, and resilience.
Describes how Spark executes jobs including DAG graph construction, scheduling, and task execution across the cluster.
Discusses data shuffling mechanisms in Spark and methods for joining RDDs, highlighting shuffle versus broadcast joins.
Analyzes Spark's fault tolerance strategies that utilize RDD lineage for data recovery without costly replication.
Details the distinctions between transformations and actions in Spark, emphasizing lazy evaluation and how RDDs are operated upon.
Describes the processes involved in creating RDDs, managing partitions, and the concepts of narrow and wide dependencies. Explains Spark's handling of shared variables, specifically broadcast variables and accumulators for efficient data processing. Discusses RDD partitions, caching strategies, and memory management to optimize performance in Spark applications.
Focuses on the ability to extend Spark's RDD API for custom operations and highlights the benefits of using RDDs in big data processing.
Provides links to resources and references for further reading on Apache Spark and its functionalities.
Concludes the presentation and invites further connection through LinkedIn.
What is Spark?
•An open-source cluster computing framework
• Leverages distributed memory
• Allows programs to load data into a cluster's memory and query it repeatedly
• Compared to Hadoop
– Scalability - can work with large data
– Fault tolerance : can self-recover
• Functional programming model
• Supports batch & streaming analysis
4.
What is Spark?
•Separate, fast MapReduce-like engine
– In-memory data storage for very fast iterative queries
– General execution graphs and powerful optimizations
• Compatible with Hadoop storage APIs
– Can read / write to any Hadoop-supported system, including HDFS, HBase, Sequence
Files etc
• Faster Application Development - 2-5x less code
• Disk Execution Speed - 10× faster
• Memory Execution Speed – 100× faster
5.
What is Spark?
•Apart from simple map and reduce operations, supports SQL queries, streaming
data, and complex analytics such as machine learning and graph algorithms out-
of-the-box
• In-memory cluster computing
• Supports any existing Hadoop input / output format
• Spark is written in Scala
• Provides concise and consistentAPIs in Scala, Java and Python
• Offers interactive shell for Scala and Python
6.
Spark Deployments –Cluster ManagerTypes
• Standalone (native Spark cluster)
• HadoopYARN - Hadoop 2 resource manager
• Apache Mesos - generic cluster manager that can also handle MapReduce
• Local - A pseudo-distributed local mode for development or testing using
local file system
– Spark runs on a single machine with one executor per CPU core
Project History
2009 –Project
started at UC
BerkeleyAMPLab
2010 - Open
sourced under a
BSD license
2013- the project
was donated to the
Apache Software
Foundation and
switched its license
to Apache 2.0
Feb 2014 - became
an ApacheTop-
Level Project
November 2014 -
engineering team
at Databricks used
Spark to set a new
world record in
large scale sorting
10.
The Most ActiveOpen Source Project in Big Data
Series 1,
Hadoop
MapReduce,
103
Series 1,
Giraph, 32
Series 1,
Storm, 25
Series 1,Tez,
17
Series 2,
Spark, 125
Projectcontributorsinpastyear
11.
Hadoop Model
• Hadoophas an acyclic data flow model
– Load data -> process data ->write output -> finished
• Hadoop is slow due to replication, serialization and disk IO
• Hadoop is at a disadvantage to pipeline multiple jobs
• Cheaper DRAM makes it a better option for using main memory for
intermediate results instead of disks
Spark Model
• MapReduceallows sharing data across jobs using only one option of stable
storage like file system which is slow
• Applications want to reuse intermediate results across multiple computations
– Work on same dataset to optimize parameters in machine learning algorithms
– More complex, multi-stage applications (iterative graph algorithms and machine
learning)
– More interactive ad-hoc queries
– Efficient primitives for data sharing across parallel jobs
• These challenges can be tackled by keeping intermediate results in memory
• Caching the data for multiple queries benefits interactive data analysis tools
14.
Spark - In-MemoryData Sharing
10-100× faster than network and disk
Input
One-time
Processing
Distributed
memory
Result 1
Result 3
Result 2
iteration1 iteration2 Iteration n
Input
Stack
• Spark SQL
–allows querying data via SQL as well as the ApacheVariant of SQL (HQL) and supports
many sources of data, including Hive tables, Parquet and JSON
• Spark Streaming
– Components that enables processing of live streams of data in a elegant, fault tolerant,
scalable and fast way
• MLlib
– Library containing common machine learning (ML) functionality including algorithms
such as classification, regression, clustering, collaborative filtering to scale out across a
cluster
17.
Stack
• GraphX
– Libraryfor manipulating graphs and performing graph-parallel computation
• Cluster Managers
– Spark is designed to efficiently scale up from one to many thousands of compute
nodes. It can run over a variety of cluster managers including Hadoop,YARN, Apache
Mesos etc
– Spark has a simple cluster manager included in Spark itself called the Standalone
Scheduler
Programming Model
• Sparkprogramming model is based on parallelizable operators
• Parallelizable operators are higher-order functions that execute user-defined functions in
parallel
• A data flow is composed of any number of data sources, operators, and data sinks by
connecting their inputs and outputs
• Job description is based on directed acyclic graphs (DAG)
• Spark allows programmers to develop complex, multi-step data pipelines using directed
acyclic graph (DAG) pattern
• Since spark is based on DAG , it can follow a chain from child to parent to fetch any value
like tree traversal
• DAG supports fault-tolerance
How SparkWorks
• Usersubmits Jobs
• Every Spark application consists of a driver program that launches various
parallel operations on the cluster
• The driver program contains your application’s main function and defines
distributed datasets on the cluster, then applies operations to them
22.
How SparkWorks
• Driverprograms access spark through theSparkContextobject, which
represents a connection to a computing cluster.
• The SparkContext can be used to build RDDs (Resilient distributed
datasets) on which you can run a series of operations
• To run these operations, driver programs typically manage a number of
nodes called executors
23.
How SparkWorks
• SparkContext(driver) contacts Cluster Manager which
assigns cluster resources
• Then it sends application code to assigned Executors
(distributing computation, not data)
• Finally sends tasks to Executors to run
24.
How SparkWorks
• SparkContext(driver) contacts Cluster Manager which assigns cluster
resources
• Then it sends application code to assigned Executors (distributing
computation, not data)
• Finally sends tasks to Executors to run
25.
Spark Context
• Mainentry point to Spark functionality
• Available in shell as variable sc
• In standalone programs, you should make your own
• import org.apache.spark.SparkContext
• import org.apache.spark.SparkContext.
• val sc = new SparkContext(master, appName, [sparkHome], [jars])
26.
RDD - ResilientDistributed Datasets
• A distributed memory abstraction
• An immutable distributed collection of data partitioned across machines
in a cluster – provides scalability
• Immutability provides safety with parallel processing
• Distributed - stored in memory across the cluster
27.
RDD - ResilientDistributed Datasets
• Stored in-memory - automatically rebuilt if a partition is lost
• In-memory storage makes it fast
• Facilitates two types of operations- transformation and action
• Lazily evaluated
• Type inferred
28.
RDDs
• Fault-tolerant collectionof elements that can be operated on in parallel
• Manipulated through various parallel operators using a diverse set of
transformations (map, filter, join etc)
• Fault recovery without costly replication
• Remembers the series of transformations that built an RDD (its lineage) to re-
compute lost data
• RDD operators are higher order functions
• Turn a collection into an RDD
– val a = sc.parallelize(Array(1, 2, 3))
Program Execution
• Thedriver program when starting execution builds up a graph where nodes are
RDD and edges are transformation steps
• No execution happens at the cluster until an action is encountered
• The driver program ships the execution graph as well as the code block to the
cluster, where every worker server will get a copy
• The execution graph is a DAG
• Each DAG is a atomic unit of execution
32.
Program Execution
• Eachsource node (no incoming edge) is an external data source or driver
memory
• Each intermediate node is a RDD
• Each sink node (no outgoing edge) is an external data source or driver
memory
• Green edge connecting to RDD represents a transformation
• Red edge connecting to a sink node represents an action
How SparkWorks?
• Sparkis divided in various independent layers with responsibilities
• The first layer is the interpreter - Spark uses a Scala interpreter, with some
modifications
• When code is typed in spark console (creating RDD’s and applying operators),
Spark creates a operator graph
• When an action is run, the Graph is submitted to a DAG Scheduler
• DAG scheduler divides operator graph into (map and reduce) stages
• A stage consists of tasks based on partitions of the input data
35.
How SparkWorks?
• TheDAG scheduler pipelines operators together to optimize the graph
– Example - many map operators can be scheduled in a single stage
• The final result of a DAG scheduler is a set of stages that are passed on toTask
Scheduler
• The task scheduler launches tasks via cluster manager (Spark
Standalone/Yarn/Mesos)
• The task scheduler doesn’t know about dependencies among stages
• The Worker executes the tasks by starting a new JVM per job
• The worker knows only about the code that is passed to it
Job Scheduling
• Whenan action on an RDD is executed, the scheduler builds a DAG of stages from
the RDD lineage graph
• A stage contains many pipelined transformations with narrow dependencies
• The boundary of a stage
– Shuffles for wide dependencies.
– Already computed partitions
38.
Job Scheduling
• Thescheduler launches tasks to compute missing partitions from each
stage until it computes the target RDD
• Tasks are assigned to machines based on data locality
• If a task needs a partition, which is available in the memory of a node, the
task is sent to that node
Data Shuffling
• Sparkships the code to a worker server where data processing happens
• But data movement cannot be completely eliminated
• Example - if the processing requires data residing in different partitions
to be grouped first, then data should be shuffled among worker servers
• Transformation operation has two types – Narrow andWide
41.
Data Shuffling
• Narrowtransformation
– The processing where the processing logic depends only on data that is already
residing in the partition and data shuffling is unnecessary
– Examples - filter(), sample(), map(), flatMap() etc
• Wide transformation
– The processing where the processing logic depends on data residing in multiple
partitions and therefore data shuffling is needed to bring them together in one place
– Example - groupByKey(), reduceByKey() etc
RDD Joins
• Joiningof two RDD affects the amount of data shuffled
• Spark provides two ways to join data – shuffle and broadcast
• Shuffle join - data of two RDD with the same key is redistributed to the same
partition. Each of the items in each RDD is shuffled across worker servers
• Broadcast join - one of the RDD is broadcasted and copied over to every partition
– If one of the RDD is significantly smaller relative to the other, then broadcast join
reduces the network traffic because only the small RDD need to be copied to all worker
servers while the large RDD doesn't need to be shuffled at all
Fault Resiliency
• RDDstrack series of transformations used to build them (their lineage) to re-compute lost
data
• messages = textFile.filter(lambda s: s.startsWith(“ERROR”)).map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
46.
Fault Resiliency
• RDDsmaintain lineage information used to reconstruct lost partitions
• Logging lineage rather than the actual data
• No replication
• Recompute only the lost partitions of an RDD
47.
Fault Resiliency
• Recoverymay be time-consuming for RDDs with long lineage chains and
wide dependencies
• It is helpful to checkpoint some RDDs to stable storage
• Decision about which data to checkpoint is left to users
48.
Fault Resiliency
• DAGdefines a deterministic transformation steps between different
partitions of data within each RDD
• Whenever a worker server crashes during the execution of a stage,
another worker server re-executes the stage from the beginning by
pulling the input data from its parent stage that has the output data
stored in local files
49.
Fault Resiliency
• Incase the result of the parent stage is not accessible (the worker server
lost the file), the parent stage need to be re-executed as well
• Imagine this is a lineage of transformation steps, and any failure of a step
will trigger a restart of execution from its last step
• Since the DAG itself is an atomic unit of execution, all the RDD values will
be forgotten after the DAG finishes its execution
50.
Fault Resiliency
• Therefore,after the driver program finishes an action (which execute a DAG to its
completion), all the RDD value will be forgotten and if the program access the
RDD again in subsequent statement, the RDD needs to be recomputed again
from its dependents
• To reduce this repetitive processing, Spark provide a caching mechanism to
remember RDDs in worker server memory (or local disk)
• Once the execution planner finds the RDD is already cache in memory, it will use
the RDD right away without tracing back to its parent RDDs
• This way, DAG is pruned once an RDD in the cache is reached
51.
RDD Operators -Transformations
•Creates a new dataset from an existing map, filter, distinct, union,
sample, groupByKey, join, etc…
• RDD transformations allow to create dependencies between RDDs
• Dependencies are only steps for producing results (a program)
52.
RDD Operators -Transformations
•Each RDD in lineage chain (string of dependencies) has a function for
calculating its data and has a pointer (dependency) to its parent RDD
• Spark divides RDD dependencies into stages and tasks and send those to
workers for execution
• Lazy operators
RDD Operators -Actions
• Return a value after running a computation
• Compute a result based on a RDD
• Result is returned to the driver program or saved to an external storage
system
• Typical RDD actions are count, first, collect, first, takeSample, foreach
Transformations
• Set ofoperations of a RDD that define how its data should be transformed
• An operation such as map(), filter() or union on a RDD that yield another RDD
• Transformations create new RDD based on the existing RDD.
• RDD's are immutable
• Lazily evaluated - Data in RDD's is not processed until an action is performed.
57.
Transformations
• Why lazyexecution? because we are expecting to apply some
optimization of the series of transformation on such RDD
• Spark driver remembers the transformation applied to an RDD – so a lost
partition is can be reconstructed on some other machine in the cluster
• This resiliency is achieved via a LineageGraph
58.
Transformations
• Words -an RDD containing a reference to lines RDD
• When program executes, first lines' function isexecuted (load the data from a text
file)
• Then words' function is executed on the resulting data (split lines into words)
• Spark is lazy, so nothing is executed unless some transformation or action is
called that triggers job creation and execution (collect in this example)
• RDD (transformed RDD, too) is not 'a set of data', but a step in a program (might
be the only step) telling Spark how to get the data and what to do with it
59.
Transformations
• val lines= sc.textFile("...")
• val words = lines.flatMap(line => line.split(" "))
• val localwords = words.collect()
60.
Actions
• Applies alltransformations on RDD and then performs the action to obtain results
• Operations that return a final value to the driver program or write data to an
external storage system
• After performing action on RDD, the result is returned to the driver program or
written to the storage system
61.
Actions
• Actions forcethe evaluation of the transformations required for the RDD
they were called on, since they need to actually produce output
• Action can be recognized by looking at the return value
– primitive and built-in types such as int, long, List<Object>,Array<Object>, …
(action).
RDD Creation
• Readfrom data sources - HDFS, JSON files, text files - any kind of files
• Transforming other RDDs using parallel operations - transformations and actions
• RDD keeps information about how it was derived from other RDDs
• A RDD has a set of partitions and a set of dependencies on parent RDD
• Narrow dependency if it derives from only 1 parent
66.
RDD Creation
• Widedependency if it has more than 2 parents (joining 2 parents)
• A function to compute the partitions from its parents
• Metadata about its partitioning scheme and data placement (preferred
location to compute for each partition)
• Partitioner (defines strategy of partitioning its partitions)
67.
SharedVariables
• When Sparkruns a function in parallel as a set of tasks on different nodes, it ships
a copy of each variable used in the function to each task
• These variables are copied to each machine
• No updates to the variables on the remote machine are propagated back to the
driver program
• Spark does provide two limited types of shared variables for two common usage
patterns
– broadcast variables
– accumulators
68.
BroadcastVariables
• A broadcastvariable is a read-only variable made available from the driver
program that runs the SparkContext object to the nodes that will execute the
computation
• Useful in applications that make same data available to the worker nodes in an
efficient manner, such as machine learning algorithms
• The broadcast values are not shipped to the nodes more than once
69.
BroadcastVariables
• To createbroadcast variables, call a method on SparkContext
– val broadcastAList = sc.broadcast(List("a", "b", "c", "d", "e"))
• Spark attempts to distribute broadcast variables using efficient broadcast
algorithms to reduce communication cost
– For example, to give every node a copy of a large input dataset efficiently
70.
Accumulators
• An accumulatoris also a variable that is broadcasted to the worker nodes
• Variables that can only be added to through an associative operation
• The addition must be an associative operation so that the global accumulated
value can be correctly computed in parallel and returned to the driver program
• Used to implement counters and sums, efficiently in parallel
71.
Accumulators
• Spark nativelysupports accumulators of numeric value types and standard
mutable collections, and programmers can extend for new types
• Only the driver program can read an accumulator’s value, not the task
• Each worker node can only access and add to its own local accumulator value
• Only the driver program can access the global value
• Accumulators are also accessed within the Spark code using the value method
RDD Partitions
• AnRDD is divided into a number of partitions, which are atomic pieces of
information
• Partitions of an RDD can be stored on different nodes of a cluster
• RDD data is just collection of partitions
• Logical division of data
• Derived from Hadoop Map/Reduce
• All input, intermediate and output data will be represented as partitions
• Partitions are basic unit of parallelism
Partitioning - Immutability
•All partitions are immutable
• Each RDD has 2 sets of parallel operations - transformation and action
• Every transformation generates new partition
• Partition immutability driven by underneath storage like HDFS
• Partition immutability allows for fault recovery
76.
Partitioning - Distribution
•Partitions derived from HDFS are distributed by default
• Partitions are also location aware
• Location awareness of partitions allow for data locality
• For computed data, using caching we can distribute in memory also
77.
Accessing Partitions
• Accessedtogether single row at a time
• Use mapParititonsAPI of RDD
• Allows to do partionwise operation which cannot be done by accessing
single row
78.
Partitioning ofTransformed Data
•Partitioning is different for key/value pairs that are generated by shuffle
operation
• Partitioning is driven by partitioner specified
• By default HashPartitioner is used
• Can use your own partitioner too
79.
Custom Partitioner
• Partitionthe data according to your data structure
• Custom partitioning allows control over no of partitions and
the distribution of data across when grouping or reducing is
done
80.
Lookup Operation
• Partitioningallows faster lookups
• Lookup operation allows to look up for a given value by specifying the key
• Using partitioner, lookup determines which partition look for
• Then it only need to look in that partition
• If no partition is specified, it will fallback to filter
81.
Laziness – ParentDependency
• Each RDD has access to parent RDD
• Value of parent for first RDD is nil
• Before computing it’s value, it always computes it’s parent
• This chain of running allows for laziness
82.
Subclassing
• Each sparkoperator, creates an instance of specific sub class of RDD
• Map operator results in MappedRDD, flatMap in FlatMappedRDD etc
• Subclass allows RDD to remember the operation that is performed in the
transformation
83.
RDDTransformations
• val dataRDD= sc.textFile(args(1))
• val splitRDD = dataRDD.flatMap(value =>value.split(“ “)
• Compute
– A function for evaluation of each partition in
RDD
– An abstract method of RDD
– Each sub class of RDD like MappedRDD,
FilteredRDD have to override this method
84.
Compute Function
• Afunction for evaluation of each partition in RDD
• An abstract method of RDD
• Each sub class of RDD like MappedRDD, FilteredRDD have to override
this method
85.
Lineage
• Transformations usedto build an RDD
• RDDs are stored as chain of objects
capturing the lineage of each RDD
• val file = sc.textFile("hdfs://...")
• val sics = file.filter(_.contains("SICS"))
• val cachedSics = sics.cache()
• val ones = cachedSics.map(_ => 1)
• val count = ones.reduce(_+_)
86.
RDD Actions
• valdataRDD = sc.textFile(args(1))
• val flatMapRDD = dataRDD.flatMap(value => value.split(““)
• flatMapRDD.collect()
• runJob API
– an API of RDD for action implementation
– Allows taking each partition and evaluate
– Internally used by all spark actions
87.
Memory Management
• Ifthere is not enough space in memory for a new computed RDD partition, a
partition from the least recently used RDD is evicted
• Spark provides three options for storage of persistent RDDs
– In memory storage as de-serialized Java objects
– In memory storage as serialized Java objects
– On disk storage
• When an RDD is persisted, each node stores any partitions of the RDD that it
computes in memory - allows future actions to be much faster
88.
Memory Management
• Persistingan RDD using persist() or cache() methods
• Storage levels
– MEMORY ONLY
– MEMORYAND DISK
– MEMORY ONLY SER
– MEMORYAND DISK SER
– MEMORY ONLY 2, MEMORYAND DISK 2...
89.
Caching
• Cache internallyuses persistAPI
• Persist sets a specific storage level for a given RDD
• Spark context tracks persistent RDD
• Partition is put into memory by block manager
90.
Caching - BlockManager
• Handles all in memory data in spark
• Responsible for
– Cached Data ( BlockRDD)
– Shuffle Data
– Broadcast data
• Partition will be stored in Block with id (RDD.id, partition_index)
91.
Working of Caching
•Partition iterator checks the storage level
• if Storage level is set it calls cacheManager.getOrCompute(partition)
• As iterator is run for each RDD evaluation, it is transparent to user
Extending Spark API
•Extending RDD API allows creating custom RDD structure
• Custom RDD’s allow control over computation
• Possible to change partitions, locality and evaluation depending upon
requirements
94.
Extending Spark API
•Custom operators to RDD
– Domain specific operators to specific RDD’s
– Uses Scala implicit mechanism
– Feels and works like built in operator
• Custom RDD
– Extend RDD API to create new RDD
– Combined with RDD makes it powerful
95.
RDD Benefits
• Dataand intermediate results are stored in memory to speed up
computation and located on the adequate nodes for optimization
• Able to perform transformation operation on RDD many times
• Calculate lineage information about RDD transformation for failure
recovery - If a failure occurs operating a partition it is re-operated
96.
RDD Benefits -Persistence
• Default is in memory
• Able to locate replica on plural nodes
• If data does not fit in memory, spill data to a disk
• Better to make a checkpoint when a lineage is long or wide dependency
exist on a lineage - Making checkpoint is performed in the background
97.
RDD Benefits
• Datalocality works in narrow dependency
• Intermediate results in wide dependency is dumped to a disk like a
mapper output
• Comparison to DSM (Distributed Sharing Memory)
– Hard to implement fault-tolerance on commodity servers
– RDD is immutable, so easy to take a backup
– In DSM, tasks access to the same memory location and interfere with each
other's updates
References
1. Resilient DistributedDatasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf
Chowdhury. Technical Report UCB/EECS-2011-82. July 2011
2. M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized Streams: Fault-Tolerant Streaming Computation at Scale,
SOSP 2013, November 2013
3. K. Ousterhout, P. Wendell, M. Zaharia and I. Stoica. Sparrow: Distributed, Low-Latency Scheduling, SOSP 2013, November 2013
4. R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale, SIGMOD 2013, June 2013
5. A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, Dominant Resource Fairness: Fair Allocation of Multiple
Resources Types, NSDI 2011, March 2011
6. Spark: In-Memory Cluster Computing for Iterative and Interactive Applications, Stanford University, Stanford, CA, February 2011
7. Spark: Cluster Computing with Working Sets, HotCloud 2010, Boston, MA, June 2010
8. http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark-SQL-
Michael-Armbrust.pdf
9. https://github.com/apache/spark/tree/master/sql