Welcome to
Session on
Spark
Architecture
World Prior to Spark
Philosophy of Distributed Systems
Google File System & its Architecture
Introduction to Spark Architecture
Agenda
World Prior to
Spark ??
Exercise
Find the Sum of all these
multiplications.
 Distributed Systems :-
• Collection of Individual computing Devices that can communicate with each other
• Computing Devices are Autonomous in nature
• Independent Computing devices are called Nodes
• Nodes can act independently of each other
• Nodes are programmed to achieve common goals which are realized by exchanging messages
with each other ( Message Passing System)
• Has a Distribution software called Middleware, which runs on the OS of each Node
• It should emerge as a Single Coherent System
 Properties of Distributed Systems :-
• Concurrency : Multiple programs run together
• Shared Data : Data is accessed simultaneously by multiple entities
• No Global Clock : Each component has a local notion of time
• Interdependency : Independent components depend on each other
Logical Design of Distributed System
 Distributed Computing System Design Challenges:-
• Communication :- Communication among processes
• Processes :- Management of processes/threads on client servers
• Synchronization :- Coordination among the processes in essential
• Fault Tolerance :- Failures of Link/Node/Processes
• Transparency :- Hiding the Implementation policies from the user (Single Coherent System)
 Algorithmic challenges in Distributed Computing Systems:-
• Synchronization/ Coordination Mechanism :- System must be allowed to operate concurrently
 Algorithms:-
• Leader Election
• Mutual Election
• Termination Detection
• Garbage Collection
• Fault Tolerance :-
 Algorithms:-
• Consensus Algorithm
• Voting and Quorum Systems
• Self Stabilizing Systems
 GFS :- Google File System is scalable distributed file system for large data Intensive
applications
 Motivation for GFS:-
1) Exploiting Commodity Hardware – Linux Machines
2) Maximize the cost per dollar
 Goals :-
1) Performance
2) Scalability
3) Reliability
4) Availability
 Design of GFS is Driven by :-
1) Component Failures
2) Huge Files
3) Mutation of Files
4) File System API
Google File System
Cluster Architecture
 GFS Overview :-
• Single Master :- Centralized Management
• Files Stored as Chunks :- With fixed size of 64 MB each
• Reliability through Replication:- Each chunk is replicated across 3 or more chunk servers
• Data Caching:- Due to large size of Data sets
• Interface :- Google Maps
 Role of MASTER :- Maintains all File Meta Data
• File Namespace
• File to Chunk Mapping :- 1 chunk = 64 to 128 MB
• Chunk Location information
• Monitor - Heartbeat
• Centralized Controller
 Operational Log:- Metadata maintained by
Master
• Persistent record of critical metadata
changes
• Replicated on Multiple remote machines
• Master recovers its file system from
operational log
GFS Architecture
Consistency Model
 SPARK Keywords:
• Driver -> Spark Session <-> Master in GFS
• Cluster Manager
• Executor <-> Processes running on Nodes in GFS
• Worker Node <-> Nodes in GFS
• DAG <-> Metadata in GFS
• Partition <-> Chunk in GFS
 Driver : Driver is a process that Clients use to submit application in Spark
 Cluster Manager: The cluster manager launches executors on the worker
nodes on behalf of the driver.
 SparkSession: The SparkSession object represents a connection to a Spark
cluster.
 Executor: Spark Executors are the processes on which Spark DAG tasks run. It
is a JVM process
 DAG (Directed Acyclic Graph): DAG in Spark is a set of Vertices and Edges,
where vertices represent the RDDs and the edges represent the
Operation/actions to be applied on RDD
Correlation to SPARK
SPARK Architecture
 Role of Driver:-
• Takes Application Processing input from Client
• Takes all Transformations /Actions and creates the DAG
• Stores metadata about all RDDs and their Partitions
• Plans the Physical execution of Program
• Contains information about Executors
• Monitors set of Executors Running
 Role of Executor:-
• Executer reserves CPU and memory resources on
worker Nodes in cluster
• Executors work in parallel
• Before Executors begin execution, they register
themselves with driver program
 Role of Worker Nodes:-
• Worker nodes hosts the Executor process
• Worker Node has a finite or fixed numbers of executors
allotted
 Calculation for number of Executors
Configuration:- 1 Hardware – 6 Nodes and each
Node have 16 cores, 64GB RAM
Calculation:-
Assumption:- First on each node, 1 core and 1 GB is
needed for Operating System and Hadoop Daemons, so
we have 15 cores, 63 GB RAM for each node
Number of cores = Concurrent tasks an executor can run
Optimization Number : 5 -> means max 5 concurrent
tasks
Hence, No of Cores/ Executor = 5
Total Cores : 15 – for 5 Nodes
No of Executors/ Node : 3
Total No of Executors = 6*3 = 18
 Role of Cluster Manager:-
• Launches Executors on worker nodes on behalf of Driver
• It Monitors worker Nodes
 SPARK Overview :-
• Apache Spark is a fast and general-purpose cluster
computing system.
• It provides high-level APIs in Java, Scala, Python and
R, and an optimized engine that supports general
execution graphs
• It Supports :
o Spark SQL - For SQL and Structured Data
processing,
o MLlib – For Machine Learning
o GraphX - For Graph Processing
o Spark Streaming - For Streaming Data
 Key features of SPARK:-
• Data Parallelism
• Fault Tolerance
References:
• Distributed Computing Fundamentals book - By Jennifer Welch
• Introduction to Distributed Systems - Prof. Rajiv Mishra – IIT Patna
• Spark Documentation - Apache Spark https://spark.apache.org/
The End

Spark 1.0

  • 1.
  • 2.
    World Prior toSpark Philosophy of Distributed Systems Google File System & its Architecture Introduction to Spark Architecture Agenda
  • 3.
  • 4.
    Exercise Find the Sumof all these multiplications.
  • 5.
     Distributed Systems:- • Collection of Individual computing Devices that can communicate with each other • Computing Devices are Autonomous in nature • Independent Computing devices are called Nodes • Nodes can act independently of each other • Nodes are programmed to achieve common goals which are realized by exchanging messages with each other ( Message Passing System) • Has a Distribution software called Middleware, which runs on the OS of each Node • It should emerge as a Single Coherent System  Properties of Distributed Systems :- • Concurrency : Multiple programs run together • Shared Data : Data is accessed simultaneously by multiple entities • No Global Clock : Each component has a local notion of time • Interdependency : Independent components depend on each other
  • 6.
    Logical Design ofDistributed System
  • 7.
     Distributed ComputingSystem Design Challenges:- • Communication :- Communication among processes • Processes :- Management of processes/threads on client servers • Synchronization :- Coordination among the processes in essential • Fault Tolerance :- Failures of Link/Node/Processes • Transparency :- Hiding the Implementation policies from the user (Single Coherent System)  Algorithmic challenges in Distributed Computing Systems:- • Synchronization/ Coordination Mechanism :- System must be allowed to operate concurrently  Algorithms:- • Leader Election • Mutual Election • Termination Detection • Garbage Collection • Fault Tolerance :-  Algorithms:- • Consensus Algorithm • Voting and Quorum Systems • Self Stabilizing Systems
  • 8.
     GFS :-Google File System is scalable distributed file system for large data Intensive applications  Motivation for GFS:- 1) Exploiting Commodity Hardware – Linux Machines 2) Maximize the cost per dollar  Goals :- 1) Performance 2) Scalability 3) Reliability 4) Availability  Design of GFS is Driven by :- 1) Component Failures 2) Huge Files 3) Mutation of Files 4) File System API Google File System
  • 9.
  • 10.
     GFS Overview:- • Single Master :- Centralized Management • Files Stored as Chunks :- With fixed size of 64 MB each • Reliability through Replication:- Each chunk is replicated across 3 or more chunk servers • Data Caching:- Due to large size of Data sets • Interface :- Google Maps  Role of MASTER :- Maintains all File Meta Data • File Namespace • File to Chunk Mapping :- 1 chunk = 64 to 128 MB • Chunk Location information • Monitor - Heartbeat • Centralized Controller  Operational Log:- Metadata maintained by Master • Persistent record of critical metadata changes • Replicated on Multiple remote machines • Master recovers its file system from operational log
  • 11.
  • 12.
  • 13.
     SPARK Keywords: •Driver -> Spark Session <-> Master in GFS • Cluster Manager • Executor <-> Processes running on Nodes in GFS • Worker Node <-> Nodes in GFS • DAG <-> Metadata in GFS • Partition <-> Chunk in GFS  Driver : Driver is a process that Clients use to submit application in Spark  Cluster Manager: The cluster manager launches executors on the worker nodes on behalf of the driver.  SparkSession: The SparkSession object represents a connection to a Spark cluster.  Executor: Spark Executors are the processes on which Spark DAG tasks run. It is a JVM process  DAG (Directed Acyclic Graph): DAG in Spark is a set of Vertices and Edges, where vertices represent the RDDs and the edges represent the Operation/actions to be applied on RDD Correlation to SPARK
  • 14.
  • 15.
     Role ofDriver:- • Takes Application Processing input from Client • Takes all Transformations /Actions and creates the DAG • Stores metadata about all RDDs and their Partitions • Plans the Physical execution of Program • Contains information about Executors • Monitors set of Executors Running  Role of Executor:- • Executer reserves CPU and memory resources on worker Nodes in cluster • Executors work in parallel • Before Executors begin execution, they register themselves with driver program  Role of Worker Nodes:- • Worker nodes hosts the Executor process • Worker Node has a finite or fixed numbers of executors allotted  Calculation for number of Executors Configuration:- 1 Hardware – 6 Nodes and each Node have 16 cores, 64GB RAM Calculation:- Assumption:- First on each node, 1 core and 1 GB is needed for Operating System and Hadoop Daemons, so we have 15 cores, 63 GB RAM for each node Number of cores = Concurrent tasks an executor can run Optimization Number : 5 -> means max 5 concurrent tasks Hence, No of Cores/ Executor = 5 Total Cores : 15 – for 5 Nodes No of Executors/ Node : 3 Total No of Executors = 6*3 = 18
  • 16.
     Role ofCluster Manager:- • Launches Executors on worker nodes on behalf of Driver • It Monitors worker Nodes  SPARK Overview :- • Apache Spark is a fast and general-purpose cluster computing system. • It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs • It Supports : o Spark SQL - For SQL and Structured Data processing, o MLlib – For Machine Learning o GraphX - For Graph Processing o Spark Streaming - For Streaming Data  Key features of SPARK:- • Data Parallelism • Fault Tolerance
  • 17.
    References: • Distributed ComputingFundamentals book - By Jennifer Welch • Introduction to Distributed Systems - Prof. Rajiv Mishra – IIT Patna • Spark Documentation - Apache Spark https://spark.apache.org/
  • 18.