SPARK ARCHITECTURE
 PRESENTED BY:-
GAURAV BISWAS
BIT MESRA
SPARK COMPONENTS
 The Spark core is complemented by a set of powerful,
higher-level libraries
SparkSQL
MLlib (for machine learning)
 GraphX
RDD(Resilient Distributed Dataset)
SparkSQL Introduction
 Part of the core distribution since Spark 1.0 (2014)
 Integrated with the Spark stack Supports querying
data either via SQL or via the Hive Query Language
 Originated as the Apache Hive port to run on top of
Spark (in place of MapReduce)
 Can weave SQL queries with code transformations
 Capability to expose Spark datasets over JDBC API and
allow running the SQL like queries on Spark data
using traditional BI and visualization tools
 Bindings in Python, Scala, and Java
SQL Execution Plans
 Logical and Physical query plans
Both are trees representing query evaluation
 Internal nodes are operators over the data
Logical plan is higher-level and algebraic
Physical plan is lower-level and operational
 Logical plan operators –
Conceptually describe what operation needs to be
performed
 Physical plan operators – Correspond to implemented
access methods
Key Features of MLib
 Low level library in Spark
 Built-in data analysis workflow
 Free performance gains
 Scalable
 Python, Scala, JavaAPIs
 Broad coverage of applications & algorithms
 Rapid improvements in speed & robustness
 Easy to use
 Integrated workflow
MLlib
 MLlib is a machine learning library that provides
various algorithms designed to scale out on a cluster
for classification, regression, clustering, collaborative
filtering, and so on.
 These algorithms also work with streaming data, such
as linear regression using ordinary least squares or k-
means clustering (and more on the way).
 Apache Mahout (a machine learning library for
Hadoop) has already turned away from MapReduce
and joined forces on Spark MLlib.
GraphX
 GraphX is an API for graphs and graph parallel
execution.
 It is a network graph analytics engine.
 GraphX is a library that performs graph-parallel
computation and manipulates graph.
 It has various Spark RDD API so it can help to create
directed graphs with arbitrary properties linked to its
vertex and edges.
GraphX
 GraphX also provides various operator and algorithms
to manipulate graph.
 Clustering, classification, traversal, searching, and
pathfinding is possible in GraphX.
Spark GraphX Features
 Flexibility:
 works with both graphs and computations
 unifies ETL (Extract, Transform & Load), exploratory analysis and
iterative graph computation within a single system.
 We can view the same data as both graphs and collections, transform
and join graphs with RDDs efficiently and write custom iterative graph
algorithms
 Speed:
 provides comparable performance to the fastest specialized graph
processing systems.
 It is comparable with the fastest graph systems while retaining Spark’s
flexibility, fault tolerance and ease of use.
Spark GraphX Features
Growing Algorithm Library:
 We can choose from a growing library of graph
algorithms
 Some of the popular algorithms are page rank,
connected components, label propagation, strongly
connected components and triangle count.
Spark Core
 Shelter to API that contains the backbone of Spark i.e.
RDDs
 The basic functionality of Spark is present in Spark
Core :
 memory management
 fault recovery
 interaction with the storage system
 I/O functionalities like task dispatching
Resilient Distributed Dataset(RDD)
 Spark introduces the concept of an RDD , an
immutable fault-tolerant, distributed collection of
objects that can be operated on in parallel.
 RDD can contain any type of object and is created by
loading an external dataset or distributing a collection
from the driver program.
RDD operation
 RDDs support two types of operations:
 Transformations : transform one data collection into
another (such as map, filter, join, union, and so on),
that are performed on an RDD and which yield a new
RDD containing the result. Means create a new dataset
from an existing one
 Actions : require that the computation be performed
(such as reduce, count, first, collect, save and so on)
that return a value after running a computation on an
RDD. which return a value to the driver program or file
after running a computation on the dataset.
Properties for RDD
 Immutability
 Cacheable – linage – persist
 Lazy evaluation (it different than execution)
 Type Inferred
 Two ways to create RDDs:
 parallelizing an existing collection in your driver program,
 referencing a dataset in an external storage system,
such as a shared file system, HDFS, Hbase, Cassandra or
any data source offering a Hadoop InputFormat.
Spark Streaming
 Spark Streaming is the component of Spark which is
used to process real-time streaming data.
 It enables high-throughput and fault-tolerant stream
processing of live data streams.
END!

SPARK ARCHITECTURE

  • 1.
    SPARK ARCHITECTURE  PRESENTEDBY:- GAURAV BISWAS BIT MESRA
  • 2.
    SPARK COMPONENTS  TheSpark core is complemented by a set of powerful, higher-level libraries SparkSQL MLlib (for machine learning)  GraphX RDD(Resilient Distributed Dataset)
  • 3.
    SparkSQL Introduction  Partof the core distribution since Spark 1.0 (2014)  Integrated with the Spark stack Supports querying data either via SQL or via the Hive Query Language  Originated as the Apache Hive port to run on top of Spark (in place of MapReduce)  Can weave SQL queries with code transformations  Capability to expose Spark datasets over JDBC API and allow running the SQL like queries on Spark data using traditional BI and visualization tools  Bindings in Python, Scala, and Java
  • 5.
  • 7.
     Logical andPhysical query plans Both are trees representing query evaluation  Internal nodes are operators over the data Logical plan is higher-level and algebraic Physical plan is lower-level and operational  Logical plan operators – Conceptually describe what operation needs to be performed  Physical plan operators – Correspond to implemented access methods
  • 8.
    Key Features ofMLib  Low level library in Spark  Built-in data analysis workflow  Free performance gains  Scalable  Python, Scala, JavaAPIs  Broad coverage of applications & algorithms  Rapid improvements in speed & robustness  Easy to use  Integrated workflow
  • 9.
    MLlib  MLlib isa machine learning library that provides various algorithms designed to scale out on a cluster for classification, regression, clustering, collaborative filtering, and so on.  These algorithms also work with streaming data, such as linear regression using ordinary least squares or k- means clustering (and more on the way).  Apache Mahout (a machine learning library for Hadoop) has already turned away from MapReduce and joined forces on Spark MLlib.
  • 11.
    GraphX  GraphX isan API for graphs and graph parallel execution.  It is a network graph analytics engine.  GraphX is a library that performs graph-parallel computation and manipulates graph.  It has various Spark RDD API so it can help to create directed graphs with arbitrary properties linked to its vertex and edges.
  • 12.
    GraphX  GraphX alsoprovides various operator and algorithms to manipulate graph.  Clustering, classification, traversal, searching, and pathfinding is possible in GraphX.
  • 13.
    Spark GraphX Features Flexibility:  works with both graphs and computations  unifies ETL (Extract, Transform & Load), exploratory analysis and iterative graph computation within a single system.  We can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently and write custom iterative graph algorithms  Speed:  provides comparable performance to the fastest specialized graph processing systems.  It is comparable with the fastest graph systems while retaining Spark’s flexibility, fault tolerance and ease of use.
  • 14.
    Spark GraphX Features GrowingAlgorithm Library:  We can choose from a growing library of graph algorithms  Some of the popular algorithms are page rank, connected components, label propagation, strongly connected components and triangle count.
  • 15.
    Spark Core  Shelterto API that contains the backbone of Spark i.e. RDDs  The basic functionality of Spark is present in Spark Core :  memory management  fault recovery  interaction with the storage system  I/O functionalities like task dispatching
  • 16.
    Resilient Distributed Dataset(RDD) Spark introduces the concept of an RDD , an immutable fault-tolerant, distributed collection of objects that can be operated on in parallel.  RDD can contain any type of object and is created by loading an external dataset or distributing a collection from the driver program.
  • 17.
    RDD operation  RDDssupport two types of operations:  Transformations : transform one data collection into another (such as map, filter, join, union, and so on), that are performed on an RDD and which yield a new RDD containing the result. Means create a new dataset from an existing one  Actions : require that the computation be performed (such as reduce, count, first, collect, save and so on) that return a value after running a computation on an RDD. which return a value to the driver program or file after running a computation on the dataset.
  • 18.
    Properties for RDD Immutability  Cacheable – linage – persist  Lazy evaluation (it different than execution)  Type Inferred  Two ways to create RDDs:  parallelizing an existing collection in your driver program,  referencing a dataset in an external storage system, such as a shared file system, HDFS, Hbase, Cassandra or any data source offering a Hadoop InputFormat.
  • 19.
    Spark Streaming  SparkStreaming is the component of Spark which is used to process real-time streaming data.  It enables high-throughput and fault-tolerant stream processing of live data streams.
  • 20.