Gaurav biswas
Bit mesra
16-04-2019 1
 SPARK & ITS FEATURE
 SPARK ARCHITECTURE
 RESILIENT DISTRIBUTED DATASETS(RDDs)
 DIRECT ACYCLIC GRAPH(DAG)
 ADVANTAGES & DRAWBACKS
 CONCLUSION
16-04-2019 2
 Apache Spark : an open source cluster computing
framework for real-time data processing
 According to Spark Certified Experts: Sparks
performance is up to 100 times faster in memory and
10 times faster on disk when compared to Hadoop
 The main feature of Apache Spark is its in-memory
cluster computing that increases the processing speed
of an application
16-04-2019 3
16-04-2019 4
 Speed:
Spark runs up to 100 times faster than Hadoop
MapReduce for large-scale data processing
 Powerful Caching:
Simple programming layer provides powerful
caching and disk persistence capabilities.
 Deployment:
It can be deployed through Mesos, Hadoop via
YARN, or Spark’s own cluster manager
16-04-2019 5
 Real-Time:
It offers Real-time computation & low latency
because of in-memory computation
 Polyglot:
Spark provides high-level APIs in Java, Scala,
Python, and R. Spark code can be written in any
of these four languages. It also provides a shell
in Scala and Python
16-04-2019 6
16-04-2019 7
Figure:-Apache spark architecture
16-04-2019 8
 SPARK DRIVE :-
 Separate process to execute user application
 Creates SparkContext to schedual
 Jobs execution & negotiate with cluster
manager
 EXECUTORS :-
 Run tasks scheduled by driver
 Store computation result in memory,on disk
or off-heap
 Interact with storage systems
16-04-2019 9
 CLUSTER MANAGER :-
 Spark context works with the cluster
manager to manage various jobs
 The driver program & Spark context takes
care of the job execution within the cluster
16-04-2019 10
 Apache Spark Architecture is based on two main
abstractions:
 Resilient Distributed Dataset (RDD)
 Directed Acyclic Graph (DAG)
16-04-2019 11
16-04-2019 12
16-04-2019 13
16-04-2019 14
16-04-2019 15
 RDDs can perform two types of operations:
 Transformations: They are the operations
that are applied to create a new RDD.
 Actions: They are applied on an RDD to
instruct Apache Spark to apply computation
and pass the result back to the driver.
16-04-2019 16
16-04-2019 17
16-04-2019 18
 ADVANTAGES:
 Integration with Hadoop
 Faster
 Real time stream processing
 DRAWBACKS:
 No File Management system
 No Support for Real-Time Processing
 Cost Effective
 Manual Optimization
16-04-2019 19
 SPARK makes it easy to write and run complicated data
processing
 It enables computation of tasks at a very large scale
 Although spark has many limitations, it is still trending in
the big data world
 Due to these drawbacks, many technologies are
overtaking Spark
 Such as Flink offers complete real-time processing than
the spark
 In this way somehow other technologies overcoming the
drawbacks of Spark
16-04-2019 20
16-04-2019 21

Spark architecture

  • 1.
  • 2.
     SPARK &ITS FEATURE  SPARK ARCHITECTURE  RESILIENT DISTRIBUTED DATASETS(RDDs)  DIRECT ACYCLIC GRAPH(DAG)  ADVANTAGES & DRAWBACKS  CONCLUSION 16-04-2019 2
  • 3.
     Apache Spark: an open source cluster computing framework for real-time data processing  According to Spark Certified Experts: Sparks performance is up to 100 times faster in memory and 10 times faster on disk when compared to Hadoop  The main feature of Apache Spark is its in-memory cluster computing that increases the processing speed of an application 16-04-2019 3
  • 4.
  • 5.
     Speed: Spark runsup to 100 times faster than Hadoop MapReduce for large-scale data processing  Powerful Caching: Simple programming layer provides powerful caching and disk persistence capabilities.  Deployment: It can be deployed through Mesos, Hadoop via YARN, or Spark’s own cluster manager 16-04-2019 5
  • 6.
     Real-Time: It offersReal-time computation & low latency because of in-memory computation  Polyglot: Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be written in any of these four languages. It also provides a shell in Scala and Python 16-04-2019 6
  • 7.
  • 8.
  • 9.
     SPARK DRIVE:-  Separate process to execute user application  Creates SparkContext to schedual  Jobs execution & negotiate with cluster manager  EXECUTORS :-  Run tasks scheduled by driver  Store computation result in memory,on disk or off-heap  Interact with storage systems 16-04-2019 9
  • 10.
     CLUSTER MANAGER:-  Spark context works with the cluster manager to manage various jobs  The driver program & Spark context takes care of the job execution within the cluster 16-04-2019 10
  • 11.
     Apache SparkArchitecture is based on two main abstractions:  Resilient Distributed Dataset (RDD)  Directed Acyclic Graph (DAG) 16-04-2019 11
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
     RDDs canperform two types of operations:  Transformations: They are the operations that are applied to create a new RDD.  Actions: They are applied on an RDD to instruct Apache Spark to apply computation and pass the result back to the driver. 16-04-2019 16
  • 17.
  • 18.
  • 19.
     ADVANTAGES:  Integrationwith Hadoop  Faster  Real time stream processing  DRAWBACKS:  No File Management system  No Support for Real-Time Processing  Cost Effective  Manual Optimization 16-04-2019 19
  • 20.
     SPARK makesit easy to write and run complicated data processing  It enables computation of tasks at a very large scale  Although spark has many limitations, it is still trending in the big data world  Due to these drawbacks, many technologies are overtaking Spark  Such as Flink offers complete real-time processing than the spark  In this way somehow other technologies overcoming the drawbacks of Spark 16-04-2019 20
  • 21.