SlideShare a Scribd company logo
Welcome to
Session on
Spark
Architecture
World Prior to Spark
Philosophy of Distributed Systems
Google File System & its Architecture
Introduction to Spark Architecture
Agenda
World Prior to
Spark ??
Exercise
Find the Sum of all these
multiplications.
 Distributed Systems :-
• Collection of Individual computing Devices that can communicate with each other
• Computing Devices are Autonomous in nature
• Independent Computing devices are called Nodes
• Nodes can act independently of each other
• Nodes are programmed to achieve common goals which are realized by exchanging messages
with each other ( Message Passing System)
• Has a Distribution software called Middleware, which runs on the OS of each Node
• It should emerge as a Single Coherent System
 Properties of Distributed Systems :-
• Concurrency : Multiple programs run together
• Shared Data : Data is accessed simultaneously by multiple entities
• No Global Clock : Each component has a local notion of time
• Interdependency : Independent components depend on each other
Logical Design of Distributed System
 Distributed Computing System Design Challenges:-
• Communication :- Communication among processes
• Processes :- Management of processes/threads on client servers
• Synchronization :- Coordination among the processes in essential
• Fault Tolerance :- Failures of Link/Node/Processes
• Transparency :- Hiding the Implementation policies from the user (Single Coherent System)
 Algorithmic challenges in Distributed Computing Systems:-
• Synchronization/ Coordination Mechanism :- System must be allowed to operate concurrently
 Algorithms:-
• Leader Election
• Mutual Election
• Termination Detection
• Garbage Collection
• Fault Tolerance :-
 Algorithms:-
• Consensus Algorithm
• Voting and Quorum Systems
• Self Stabilizing Systems
 GFS :- Google File System is scalable distributed file system for large data Intensive
applications
 Motivation for GFS:-
1) Exploiting Commodity Hardware – Linux Machines
2) Maximize the cost per dollar
 Goals :-
1) Performance
2) Scalability
3) Reliability
4) Availability
 Design of GFS is Driven by :-
1) Component Failures
2) Huge Files
3) Mutation of Files
4) File System API
Google File System
Cluster Architecture
 GFS Overview :-
• Single Master :- Centralized Management
• Files Stored as Chunks :- With fixed size of 64 MB each
• Reliability through Replication:- Each chunk is replicated across 3 or more chunk servers
• Data Caching:- Due to large size of Data sets
• Interface :- Google Maps
 Role of MASTER :- Maintains all File Meta Data
• File Namespace
• File to Chunk Mapping :- 1 chunk = 64 to 128 MB
• Chunk Location information
• Monitor - Heartbeat
• Centralized Controller
 Operational Log:- Metadata maintained by
Master
• Persistent record of critical metadata
changes
• Replicated on Multiple remote machines
• Master recovers its file system from
operational log
GFS Architecture
Consistency Model
 SPARK Keywords:
• Driver -> Spark Session <-> Master in GFS
• Cluster Manager
• Executor <-> Processes running on Nodes in GFS
• Worker Node <-> Nodes in GFS
• DAG <-> Metadata in GFS
• Partition <-> Chunk in GFS
 Driver : Driver is a process that Clients use to submit application in Spark
 Cluster Manager: The cluster manager launches executors on the worker
nodes on behalf of the driver.
 SparkSession: The SparkSession object represents a connection to a Spark
cluster.
 Executor: Spark Executors are the processes on which Spark DAG tasks run. It
is a JVM process
 DAG (Directed Acyclic Graph): DAG in Spark is a set of Vertices and Edges,
where vertices represent the RDDs and the edges represent the
Operation/actions to be applied on RDD
Correlation to SPARK
SPARK Architecture
 Role of Driver:-
• Takes Application Processing input from Client
• Takes all Transformations /Actions and creates the DAG
• Stores metadata about all RDDs and their Partitions
• Plans the Physical execution of Program
• Contains information about Executors
• Monitors set of Executors Running
 Role of Executor:-
• Executer reserves CPU and memory resources on
worker Nodes in cluster
• Executors work in parallel
• Before Executors begin execution, they register
themselves with driver program
 Role of Worker Nodes:-
• Worker nodes hosts the Executor process
• Worker Node has a finite or fixed numbers of executors
allotted
 Calculation for number of Executors
Configuration:- 1 Hardware – 6 Nodes and each
Node have 16 cores, 64GB RAM
Calculation:-
Assumption:- First on each node, 1 core and 1 GB is
needed for Operating System and Hadoop Daemons, so
we have 15 cores, 63 GB RAM for each node
Number of cores = Concurrent tasks an executor can run
Optimization Number : 5 -> means max 5 concurrent
tasks
Hence, No of Cores/ Executor = 5
Total Cores : 15 – for 5 Nodes
No of Executors/ Node : 3
Total No of Executors = 6*3 = 18
 Role of Cluster Manager:-
• Launches Executors on worker nodes on behalf of Driver
• It Monitors worker Nodes
 SPARK Overview :-
• Apache Spark is a fast and general-purpose cluster
computing system.
• It provides high-level APIs in Java, Scala, Python and
R, and an optimized engine that supports general
execution graphs
• It Supports :
o Spark SQL - For SQL and Structured Data
processing,
o MLlib – For Machine Learning
o GraphX - For Graph Processing
o Spark Streaming - For Streaming Data
 Key features of SPARK:-
• Data Parallelism
• Fault Tolerance
References:
• Distributed Computing Fundamentals book - By Jennifer Welch
• Introduction to Distributed Systems - Prof. Rajiv Mishra – IIT Patna
• Spark Documentation - Apache Spark https://spark.apache.org/
The End

More Related Content

What's hot

Apache Kafka Streams
Apache Kafka StreamsApache Kafka Streams
Apache Kafka Streams
Apache Kafka TLV
 
Apache spark
Apache sparkApache spark
Apache spark
Sameer Mahajan
 
Samza tech talk_2015 - huawei
Samza tech talk_2015 - huaweiSamza tech talk_2015 - huawei
Samza tech talk_2015 - huawei
Yi Pan
 
Biomatters and Amazon Web Services
Biomatters and Amazon Web Services Biomatters and Amazon Web Services
Biomatters and Amazon Web Services
Biomatters
 
Low Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in HadoopLow Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in Hadoop
InSemble
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
datamantra
 
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAs
Ahmed Misbah
 
Hands-on Performance Tuning Lab - Devoxx Poland
Hands-on Performance Tuning Lab - Devoxx PolandHands-on Performance Tuning Lab - Devoxx Poland
Hands-on Performance Tuning Lab - Devoxx Poland
C2B2 Consulting
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
Apache Apex
 
Samza tech talk_2015 - strata
Samza tech talk_2015 - strataSamza tech talk_2015 - strata
Samza tech talk_2015 - strata
Yi Pan
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
datamantra
 
Akka at Enterprise Scale: Performance Tuning Distributed Applications
Akka at Enterprise Scale: Performance Tuning Distributed ApplicationsAkka at Enterprise Scale: Performance Tuning Distributed Applications
Akka at Enterprise Scale: Performance Tuning Distributed Applications
Lightbend
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
Joe Stein
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming Architecture
InSemble
 
Apex as yarn application
Apex as yarn applicationApex as yarn application
Apex as yarn application
Chinmay Kolhatkar
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structures
confluent
 
Apache Zeppelin & Cluster
Apache Zeppelin & ClusterApache Zeppelin & Cluster
Apache Zeppelin & Cluster
Jongyoul Lee
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
Li Gao
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
Guozhang Wang
 

What's hot (20)

Apache Kafka Streams
Apache Kafka StreamsApache Kafka Streams
Apache Kafka Streams
 
Apache spark
Apache sparkApache spark
Apache spark
 
Samza tech talk_2015 - huawei
Samza tech talk_2015 - huaweiSamza tech talk_2015 - huawei
Samza tech talk_2015 - huawei
 
Biomatters and Amazon Web Services
Biomatters and Amazon Web Services Biomatters and Amazon Web Services
Biomatters and Amazon Web Services
 
Low Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in HadoopLow Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in Hadoop
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
 
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAs
 
Hands-on Performance Tuning Lab - Devoxx Poland
Hands-on Performance Tuning Lab - Devoxx PolandHands-on Performance Tuning Lab - Devoxx Poland
Hands-on Performance Tuning Lab - Devoxx Poland
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
Samza tech talk_2015 - strata
Samza tech talk_2015 - strataSamza tech talk_2015 - strata
Samza tech talk_2015 - strata
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Akka at Enterprise Scale: Performance Tuning Distributed Applications
Akka at Enterprise Scale: Performance Tuning Distributed ApplicationsAkka at Enterprise Scale: Performance Tuning Distributed Applications
Akka at Enterprise Scale: Performance Tuning Distributed Applications
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming Architecture
 
Apex as yarn application
Apex as yarn applicationApex as yarn application
Apex as yarn application
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structures
 
Apache Zeppelin & Cluster
Apache Zeppelin & ClusterApache Zeppelin & Cluster
Apache Zeppelin & Cluster
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 

Similar to Spark 1.0

Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
Dona Mary Philip
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance
BIOVIA
 
Hadoop
HadoopHadoop
CA UNIT IV.pptx
CA UNIT IV.pptxCA UNIT IV.pptx
CA UNIT IV.pptx
ssuser9dbd7e
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
Splunk
 
try
trytry
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
High performance computing
High performance computingHigh performance computing
High performance computing
punjab engineering college, chandigarh
 
Scientific Computing - Hardware
Scientific Computing - HardwareScientific Computing - Hardware
Scientific Computing - Hardware
jalle6
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
tsliwowicz
 
Monitoring-Docker-Container-and-Dockerized-Applications
Monitoring-Docker-Container-and-Dockerized-ApplicationsMonitoring-Docker-Container-and-Dockerized-Applications
Monitoring-Docker-Container-and-Dockerized-Applications
Satya Sanjibani Routray
 
Monitoring docker container and dockerized applications
Monitoring docker container and dockerized applicationsMonitoring docker container and dockerized applications
Monitoring docker container and dockerized applications
Ananth Padmanabhan
 
Monitoring docker-container-and-dockerized-applications
Monitoring docker-container-and-dockerized-applicationsMonitoring docker-container-and-dockerized-applications
Monitoring docker-container-and-dockerized-applications
Satya Sanjibani Routray
 
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
anynines GmbH
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer Architecture
Balaji Vignesh
 
Monitoring docker containers and dockerized applications
Monitoring docker containers and dockerized applicationsMonitoring docker containers and dockerized applications
Monitoring docker containers and dockerized applications
Satya Sanjibani Routray
 

Similar to Spark 1.0 (20)

Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance
 
Hadoop
HadoopHadoop
Hadoop
 
CA UNIT IV.pptx
CA UNIT IV.pptxCA UNIT IV.pptx
CA UNIT IV.pptx
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
try
trytry
try
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Scientific Computing - Hardware
Scientific Computing - HardwareScientific Computing - Hardware
Scientific Computing - Hardware
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Monitoring-Docker-Container-and-Dockerized-Applications
Monitoring-Docker-Container-and-Dockerized-ApplicationsMonitoring-Docker-Container-and-Dockerized-Applications
Monitoring-Docker-Container-and-Dockerized-Applications
 
Monitoring docker container and dockerized applications
Monitoring docker container and dockerized applicationsMonitoring docker container and dockerized applications
Monitoring docker container and dockerized applications
 
Monitoring docker-container-and-dockerized-applications
Monitoring docker-container-and-dockerized-applicationsMonitoring docker-container-and-dockerized-applications
Monitoring docker-container-and-dockerized-applications
 
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer Architecture
 
Monitoring docker containers and dockerized applications
Monitoring docker containers and dockerized applicationsMonitoring docker containers and dockerized applications
Monitoring docker containers and dockerized applications
 

Recently uploaded

UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Envertis Software Solutions
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 

Recently uploaded (20)

UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 

Spark 1.0

  • 2. World Prior to Spark Philosophy of Distributed Systems Google File System & its Architecture Introduction to Spark Architecture Agenda
  • 4. Exercise Find the Sum of all these multiplications.
  • 5.  Distributed Systems :- • Collection of Individual computing Devices that can communicate with each other • Computing Devices are Autonomous in nature • Independent Computing devices are called Nodes • Nodes can act independently of each other • Nodes are programmed to achieve common goals which are realized by exchanging messages with each other ( Message Passing System) • Has a Distribution software called Middleware, which runs on the OS of each Node • It should emerge as a Single Coherent System  Properties of Distributed Systems :- • Concurrency : Multiple programs run together • Shared Data : Data is accessed simultaneously by multiple entities • No Global Clock : Each component has a local notion of time • Interdependency : Independent components depend on each other
  • 6. Logical Design of Distributed System
  • 7.  Distributed Computing System Design Challenges:- • Communication :- Communication among processes • Processes :- Management of processes/threads on client servers • Synchronization :- Coordination among the processes in essential • Fault Tolerance :- Failures of Link/Node/Processes • Transparency :- Hiding the Implementation policies from the user (Single Coherent System)  Algorithmic challenges in Distributed Computing Systems:- • Synchronization/ Coordination Mechanism :- System must be allowed to operate concurrently  Algorithms:- • Leader Election • Mutual Election • Termination Detection • Garbage Collection • Fault Tolerance :-  Algorithms:- • Consensus Algorithm • Voting and Quorum Systems • Self Stabilizing Systems
  • 8.  GFS :- Google File System is scalable distributed file system for large data Intensive applications  Motivation for GFS:- 1) Exploiting Commodity Hardware – Linux Machines 2) Maximize the cost per dollar  Goals :- 1) Performance 2) Scalability 3) Reliability 4) Availability  Design of GFS is Driven by :- 1) Component Failures 2) Huge Files 3) Mutation of Files 4) File System API Google File System
  • 10.  GFS Overview :- • Single Master :- Centralized Management • Files Stored as Chunks :- With fixed size of 64 MB each • Reliability through Replication:- Each chunk is replicated across 3 or more chunk servers • Data Caching:- Due to large size of Data sets • Interface :- Google Maps  Role of MASTER :- Maintains all File Meta Data • File Namespace • File to Chunk Mapping :- 1 chunk = 64 to 128 MB • Chunk Location information • Monitor - Heartbeat • Centralized Controller  Operational Log:- Metadata maintained by Master • Persistent record of critical metadata changes • Replicated on Multiple remote machines • Master recovers its file system from operational log
  • 13.  SPARK Keywords: • Driver -> Spark Session <-> Master in GFS • Cluster Manager • Executor <-> Processes running on Nodes in GFS • Worker Node <-> Nodes in GFS • DAG <-> Metadata in GFS • Partition <-> Chunk in GFS  Driver : Driver is a process that Clients use to submit application in Spark  Cluster Manager: The cluster manager launches executors on the worker nodes on behalf of the driver.  SparkSession: The SparkSession object represents a connection to a Spark cluster.  Executor: Spark Executors are the processes on which Spark DAG tasks run. It is a JVM process  DAG (Directed Acyclic Graph): DAG in Spark is a set of Vertices and Edges, where vertices represent the RDDs and the edges represent the Operation/actions to be applied on RDD Correlation to SPARK
  • 15.  Role of Driver:- • Takes Application Processing input from Client • Takes all Transformations /Actions and creates the DAG • Stores metadata about all RDDs and their Partitions • Plans the Physical execution of Program • Contains information about Executors • Monitors set of Executors Running  Role of Executor:- • Executer reserves CPU and memory resources on worker Nodes in cluster • Executors work in parallel • Before Executors begin execution, they register themselves with driver program  Role of Worker Nodes:- • Worker nodes hosts the Executor process • Worker Node has a finite or fixed numbers of executors allotted  Calculation for number of Executors Configuration:- 1 Hardware – 6 Nodes and each Node have 16 cores, 64GB RAM Calculation:- Assumption:- First on each node, 1 core and 1 GB is needed for Operating System and Hadoop Daemons, so we have 15 cores, 63 GB RAM for each node Number of cores = Concurrent tasks an executor can run Optimization Number : 5 -> means max 5 concurrent tasks Hence, No of Cores/ Executor = 5 Total Cores : 15 – for 5 Nodes No of Executors/ Node : 3 Total No of Executors = 6*3 = 18
  • 16.  Role of Cluster Manager:- • Launches Executors on worker nodes on behalf of Driver • It Monitors worker Nodes  SPARK Overview :- • Apache Spark is a fast and general-purpose cluster computing system. • It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs • It Supports : o Spark SQL - For SQL and Structured Data processing, o MLlib – For Machine Learning o GraphX - For Graph Processing o Spark Streaming - For Streaming Data  Key features of SPARK:- • Data Parallelism • Fault Tolerance
  • 17. References: • Distributed Computing Fundamentals book - By Jennifer Welch • Introduction to Distributed Systems - Prof. Rajiv Mishra – IIT Patna • Spark Documentation - Apache Spark https://spark.apache.org/