Spark architecture

Gaurav biswas
Bit mesra
16-04-2019 1

 SPARK & ITS FEATURE
 SPARK ARCHITECTURE
 RESILIENT DISTRIBUTED DATASETS(RDDs)
 DIRECT ACYCLIC GRAPH(DAG)
 ADVANTAGES & DRAWBACKS
 CONCLUSION
16-04-2019 2

 Apache Spark : an open source cluster computing
framework for real-time data processing
 According to Spark Certified Experts: Sparks
performance is up to 100 times faster in memory and
10 times faster on disk when compared to Hadoop
 The main feature of Apache Spark is its in-memory
cluster computing that increases the processing speed
of an application
16-04-2019 3

 Speed:
Spark runs up to 100 times faster than Hadoop
MapReduce for large-scale data processing
 Powerful Caching:
Simple programming layer provides powerful
caching and disk persistence capabilities.
 Deployment:
It can be deployed through Mesos, Hadoop via
YARN, or Spark’s own cluster manager
16-04-2019 5

 Real-Time:
It offers Real-time computation & low latency
because of in-memory computation
 Polyglot:
Spark provides high-level APIs in Java, Scala,
Python, and R. Spark code can be written in any
of these four languages. It also provides a shell
in Scala and Python
16-04-2019 6

16-04-2019 7
Figure:-Apache spark architecture

 SPARK DRIVE :-
 Separate process to execute user application
 Creates SparkContext to schedual
 Jobs execution & negotiate with cluster
manager
 EXECUTORS :-
 Run tasks scheduled by driver
 Store computation result in memory,on disk
or off-heap
 Interact with storage systems
16-04-2019 9

 CLUSTER MANAGER :-
 Spark context works with the cluster
manager to manage various jobs
 The driver program & Spark context takes
care of the job execution within the cluster
16-04-2019 10

 Apache Spark Architecture is based on two main
abstractions:
 Resilient Distributed Dataset (RDD)
 Directed Acyclic Graph (DAG)
16-04-2019 11

 RDDs can perform two types of operations:
 Transformations: They are the operations
that are applied to create a new RDD.
 Actions: They are applied on an RDD to
instruct Apache Spark to apply computation
and pass the result back to the driver.
16-04-2019 16

 ADVANTAGES:
 Integration with Hadoop
 Faster
 Real time stream processing
 DRAWBACKS:
 No File Management system
 No Support for Real-Time Processing
 Cost Effective
 Manual Optimization
16-04-2019 19

 SPARK makes it easy to write and run complicated data
processing
 It enables computation of tasks at a very large scale
 Although spark has many limitations, it is still trending in
the big data world
 Due to these drawbacks, many technologies are
overtaking Spark
 Such as Flink offers complete real-time processing than
the spark
 In this way somehow other technologies overcoming the
drawbacks of Spark
16-04-2019 20

Spark architecture

More Related Content

What's hot

Similar to Spark architecture

More from GauravBiswas9

Recently uploaded

Spark architecture