An Introduction to Apache Spark

✓ Need for spark
✓ Introducton to Apache Spark
✓ Spark features
✓ Spark architecture
✓ What is RDDs
✓ Transformations & Actions
✓ Spark execution model
✓ Spark ecosystem
2

Why spark?
Need for general purpose cluster computing system
as:
➢MapReduce limited to batch processing
➢Storm limited to real time stream processing
➢Impala/Tez limited to interactive processing
➢Neo4J/Giraph limited to graph processing
3

Need for Spark
• Need for a powerful engine that can process
the data in real time(streaming) as well as in
batch mode
• Need for a powerful engine that can respond in
sub-seconds and perform in-memory analytics
• Apache Spark is a powerful open source engine
that provides real-time(stream), interactive,
graph, in-memory as well as batch processing
with speed, ease of use & sophisticated
analytics.
4

What is Apache Spark
Lightning fast and general purpose cluster
computing system
5

Introduction to Apache Spark
➢Apache Spark is lightning-fast cluster computing
tool
➢General purpose distributed system
➢Up to 100 times faster than MapReduce
➢Written in Scala
➢Provides APIs in Scala, Java and python
➢Integrate with Hadoop and can process existing
data
6

History
• Introduced by UC Berkeley’s in 2009
• Open sourced in 2010
• Donated to the Apache in 2013,beacme top-level
project in 2014
• Became most active project at Apache in 2015
7

Apache Spark features
• Speed
• Ease of use
• Low latency
• Integration with Hadoop
• Rich set of operators
• Fault tolerant
• Generalized execution model
9

Spark Architecture
• Works in master and slave fashion
– Master node
– Slave node
10

Master node
• Manager node
• Assign the work to slave nodes
• Management, monitoring, maintenance of
slaves, assign work to them, keep track of
work
• Master daemon -runs on master node
12

Slave Nodes
• Worker nodes
• Does the work assigned by master
• Slave daemon-runs on all the slave nodes
13

• User develop the work/application
• Submit work on the master
• Master will divide the work
• And submit it to all the nodes on the cluster
• All the slaves are doing sub-works
– In this manner Spark enjoys Distributed
Computing , parallel processing
15

Resilient Distributed Dataset
• Basic core abstraction in spark
– Resilient – if data is lost it will be recreated
automatically(fault tolerant )
– Distributed – data is distributedly stored/processed
– Dataset – data can come from different data-stores
16

• RDD is a simple and immutable collection of
objects
• RDD can contain any type of (Scala, Java,
Python and R)objects
• Each RDD is split-up into different partitions ,
which may be computed on different nodes of
clusters
17

What is RDD?
• RDDs are the fundamental unit of data in Spark
• Core spark abstraction
• Enable parallel processing on dataset
• Immutable, recomputable, fault tolerant
• During spark programming we perform
operations on RDDs
• Transformations and actions are used to process
RDDs
18

RDD operations
• Two types of operations
▪ Transformation
- Create a new RDD from the existing one
- Eg : map, filterMap, join ..etc
▪ Action
- Return a result or write it to storage
- Eg: count, collect, save..etc
19

• Lazy evaluation
– the execution will not start until an action is
triggered
20

Spark context
• Spark context is an object
• Every spark application requires a spark context
• Main entry point for spark application
• Interact with cluster manager
• Specify spark how to access the cluster
• RDDs are created using spark context
21

• Developer develops the application/program
• Needs the spark context object, the main
entry point of spark application, which can
interact with cluster manager
• Data nodes, slaves of HDFS
• Worker nodes, slaves of Spark
• Cluster manager will interact with the worker
node and get the resources
• Executer is the distributed agent responsible
for the execution of tasks
23

The driver program
• The driver program runs the main () function
of the application and is the place where the
Spark Context is created
• The driver program that runs on the master
node of the spark cluster schedules the job
execution and negotiates with the cluster
manager
24

Executor
• Executor is a distributed agent responsible for
the execution of tasks
• Every spark applications has its own executor
process
• Executor performs all the data processing.
• Reads from and Writes data to external
sources.
• Executor stores the computation results data
in-memory, cache or on hard disk drives.
• Interacts with the storage systems.
25

Cluster manager
• An external service responsible for acquiring
resources on the spark cluster and allocating
them to a spark job
26

Spark core
• Main spark engine
• Kernel of spark
• it is in charge of essential I/O functionalities
28

Spark SQL
• Enables users to run sql queries
• Can handle structured or semi-structured data
• One of the most popular sql engine in big data
29

Spark streaming
• Can handle live streams without any latency
• A powerful interactive and analytical
application
• Can process near real-time data from multiple
sources
• Internally convert the streams into micro
batches, process the in cluster, pushes to
data-stores
30

MLlib
• Machine Learning Library, scalable
• Used for advanced analytics
31

GraphX
• Enable users to handles the graph data processing
• We can represent our data in terms of graph
• Eg:
– in LinkedIn degree of connections, 1st degree, 2nd
degree connections
– In Facebook, friends of friends
Such type of requirements can be handle efficiently by the
Graph engine
32

Storage system
• Spark is dependent on third party storage
system, like:
– HDFS
– HBASE
– CASSANDRA
– AMAZON S3 and so on
33

Disadvantages
• No File Management System
• Expensive
• Near Real-time Processing
36

An Introduction to Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to An Introduction to Apache Spark

Similar to An Introduction to Apache Spark (20)

Recently uploaded

Recently uploaded (20)

An Introduction to Apache Spark