A presentation/slides on Apache Spark Architecture with its features, architecture, working, etc.
Introduction
Features
Understanding Apache Spark Architecture
Working of Apache Spark Architecture
Applications
Conclusion
References
2. CONTENT
Introduction
Features
Understanding Apache Spark Architecture
Working of Apache Spark Architecture
Applications
Conclusion
References
3. INTRODUCTION
An open-source, cluster-computing framework that provides in-memory processing of large
amount of data.
Its performance is up to 100 times faster in memory and 10 times faster on disk when
compared to Hadoop
With powerful APIs that help to correlate the unstructured, structured and semi-structured
data, to analyse and evaluate the data to make future predictions.
4. Features
Fast Ease of development Deployment flexibility Unified Stack
Multi- language
support
• 10x faster on disk.
• 100x faster in
memory.
• Interactive shell.
• Less code.
• More operators.
• Write programs
quickly
• Deployment -
Mesos, YARN,
Standalone
• Storage -MapR-XD,
HDFS, S3
• Build applications
combining different
processing model.
• Batch Analytics,
Streaming
Analytics and
Interactive
Analytics.
• Scala
• Python
• Java
• Spark
• R
5. UNDERSTANDING APACHE SPARK
ARCHITECTURE
Apache Spark Architecture is based on two main abstractions:
1. Resilient Distributed Dataset (RDD)
2. Directed Acyclic Graph (DAG)
Apache Spark RDD’s supports two types of operations:
1. Transformations
2. Actions
Fig.: Lazy Transformation Model
6. (Continued…)
The Apache Spark Architecture has two main daemons along with a cluster manager. It is
basically a master/slave architecture. The two daemons are:
1. Master Daemon: It handles the Master/Driver Process.
2. Worker Daemon: It handles the Slave Process.
Role of Driver
Drives own application.
Crates a JVM for the code that is being submitted by the client.
Driver stores the metadata about all the Resilient Distributed Databases and their partitions.
7. (Continued…)
Role of Cluster Manager
The role of the cluster manager is to allocate resources across applications. The Spark is capable enough of running on a large
number of clusters.
Schedules the Spark Application.
Allocates the resources to the Driver program to run the tasks.
It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and Standalone Scheduler.
Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install Spark on an empty set of
machines.
Role of Worker
Consists of Executors and tasks
Executes the tasks assigned by Cluster Manager.
Role of Executor
Executor performs all the data processing.
Reads from and Writes data to external sources.
Executor stores the computation results data in-memory, cache or on hard disk drives.
Interacts with the storage systems.
10. WORKING OF APACHE SPARK
ARCHITECTURE
The client submits spark user application code. When an application code is submitted, the
driver implicitly converts user code that contains transformations and actions into a logically
directed acyclic graph called DAG. At this stage, it also performs optimizations such as pipelining
transformations.
After that, it converts the logical graph called DAG into physical execution plan with many
stages. After converting into a physical execution plan, it creates physical execution units called
tasks under each stage. Then the tasks are bundled and sent to the cluster.
Now the driver talks to the cluster manager and negotiates the resources. Cluster manager
launches executors in worker nodes on behalf of the driver. At this point, the driver will send the
tasks to the executors based on data placement. When executors start, they register themselves
with drivers. So, the driver will have a complete view of executors that are executing the task.
During the course of execution of tasks, driver program will monitor the set of executors that
runs. Driver node also schedules future tasks based on data placement.
12. CONCLUSION
Spark can run independently. Thus it gives flexibility.
The architecture enumerates its ease of use, accessibility, and the ability to handle big data
tasks.
The architecture has finally come to dominate Hadoop mainly because of its speed. It finds
usage in many industries. It has taken Hadoop MapReduce to a completely new level with few
shuffles in the processing of data. The efficiency 100X of the system is enhanced by the in-
memory data storage and real-time processing of data.
The lazy evaluation contributes to the speed.