Spark 1.0

Welcome to
Session on
Spark
Architecture

World Prior to Spark
Philosophy of Distributed Systems
Google File System & its Architecture
Introduction to Spark Architecture
Agenda

Exercise
Find the Sum of all these
multiplications.

 Distributed Systems :-
• Collection of Individual computing Devices that can communicate with each other
• Computing Devices are Autonomous in nature
• Independent Computing devices are called Nodes
• Nodes can act independently of each other
• Nodes are programmed to achieve common goals which are realized by exchanging messages
with each other ( Message Passing System)
• Has a Distribution software called Middleware, which runs on the OS of each Node
• It should emerge as a Single Coherent System
 Properties of Distributed Systems :-
• Concurrency : Multiple programs run together
• Shared Data : Data is accessed simultaneously by multiple entities
• No Global Clock : Each component has a local notion of time
• Interdependency : Independent components depend on each other

Logical Design of Distributed System

 Distributed Computing System Design Challenges:-
• Communication :- Communication among processes
• Processes :- Management of processes/threads on client servers
• Synchronization :- Coordination among the processes in essential
• Fault Tolerance :- Failures of Link/Node/Processes
• Transparency :- Hiding the Implementation policies from the user (Single Coherent System)
 Algorithmic challenges in Distributed Computing Systems:-
• Synchronization/ Coordination Mechanism :- System must be allowed to operate concurrently
 Algorithms:-
• Leader Election
• Mutual Election
• Termination Detection
• Garbage Collection
• Fault Tolerance :-
 Algorithms:-
• Consensus Algorithm
• Voting and Quorum Systems
• Self Stabilizing Systems

 GFS :- Google File System is scalable distributed file system for large data Intensive
applications
 Motivation for GFS:-
1) Exploiting Commodity Hardware – Linux Machines
2) Maximize the cost per dollar
 Goals :-
1) Performance
2) Scalability
3) Reliability
4) Availability
 Design of GFS is Driven by :-
1) Component Failures
2) Huge Files
3) Mutation of Files
4) File System API
Google File System

 GFS Overview :-
• Single Master :- Centralized Management
• Files Stored as Chunks :- With fixed size of 64 MB each
• Reliability through Replication:- Each chunk is replicated across 3 or more chunk servers
• Data Caching:- Due to large size of Data sets
• Interface :- Google Maps
 Role of MASTER :- Maintains all File Meta Data
• File Namespace
• File to Chunk Mapping :- 1 chunk = 64 to 128 MB
• Chunk Location information
• Monitor - Heartbeat
• Centralized Controller
 Operational Log:- Metadata maintained by
Master
• Persistent record of critical metadata
changes
• Replicated on Multiple remote machines
• Master recovers its file system from
operational log

 SPARK Keywords:
• Driver -> Spark Session <-> Master in GFS
• Cluster Manager
• Executor <-> Processes running on Nodes in GFS
• Worker Node <-> Nodes in GFS
• DAG <-> Metadata in GFS
• Partition <-> Chunk in GFS
 Driver : Driver is a process that Clients use to submit application in Spark
 Cluster Manager: The cluster manager launches executors on the worker
nodes on behalf of the driver.
 SparkSession: The SparkSession object represents a connection to a Spark
cluster.
 Executor: Spark Executors are the processes on which Spark DAG tasks run. It
is a JVM process
 DAG (Directed Acyclic Graph): DAG in Spark is a set of Vertices and Edges,
where vertices represent the RDDs and the edges represent the
Operation/actions to be applied on RDD
Correlation to SPARK

 Role of Driver:-
• Takes Application Processing input from Client
• Takes all Transformations /Actions and creates the DAG
• Stores metadata about all RDDs and their Partitions
• Plans the Physical execution of Program
• Contains information about Executors
• Monitors set of Executors Running
 Role of Executor:-
• Executer reserves CPU and memory resources on
worker Nodes in cluster
• Executors work in parallel
• Before Executors begin execution, they register
themselves with driver program
 Role of Worker Nodes:-
• Worker nodes hosts the Executor process
• Worker Node has a finite or fixed numbers of executors
allotted
 Calculation for number of Executors
Configuration:- 1 Hardware – 6 Nodes and each
Node have 16 cores, 64GB RAM
Calculation:-
Assumption:- First on each node, 1 core and 1 GB is
needed for Operating System and Hadoop Daemons, so
we have 15 cores, 63 GB RAM for each node
Number of cores = Concurrent tasks an executor can run
Optimization Number : 5 -> means max 5 concurrent
tasks
Hence, No of Cores/ Executor = 5
Total Cores : 15 – for 5 Nodes
No of Executors/ Node : 3
Total No of Executors = 6*3 = 18

 Role of Cluster Manager:-
• Launches Executors on worker nodes on behalf of Driver
• It Monitors worker Nodes
 SPARK Overview :-
• Apache Spark is a fast and general-purpose cluster
computing system.
• It provides high-level APIs in Java, Scala, Python and
R, and an optimized engine that supports general
execution graphs
• It Supports :
o Spark SQL - For SQL and Structured Data
processing,
o MLlib – For Machine Learning
o GraphX - For Graph Processing
o Spark Streaming - For Streaming Data
 Key features of SPARK:-
• Data Parallelism
• Fault Tolerance

References:
• Distributed Computing Fundamentals book - By Jennifer Welch
• Introduction to Distributed Systems - Prof. Rajiv Mishra – IIT Patna
• Spark Documentation - Apache Spark https://spark.apache.org/

Spark 1.0

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark 1.0

Similar to Spark 1.0 (20)

Recently uploaded

Recently uploaded (20)

Spark 1.0