SPARK ARCHITECTURE

SPARK ARCHITECTURE
 PRESENTED BY:-
GAURAV BISWAS
BIT MESRA

SPARK COMPONENTS
 The Spark core is complemented by a set of powerful,
higher-level libraries
SparkSQL
MLlib (for machine learning)
 GraphX
RDD(Resilient Distributed Dataset)

SparkSQL Introduction
 Part of the core distribution since Spark 1.0 (2014)
 Integrated with the Spark stack Supports querying
data either via SQL or via the Hive Query Language
 Originated as the Apache Hive port to run on top of
Spark (in place of MapReduce)
 Can weave SQL queries with code transformations
 Capability to expose Spark datasets over JDBC API and
allow running the SQL like queries on Spark data
using traditional BI and visualization tools
 Bindings in Python, Scala, and Java

 Logical and Physical query plans
Both are trees representing query evaluation
 Internal nodes are operators over the data
Logical plan is higher-level and algebraic
Physical plan is lower-level and operational
 Logical plan operators –
Conceptually describe what operation needs to be
performed
 Physical plan operators – Correspond to implemented
access methods

Key Features of MLib
 Low level library in Spark
 Built-in data analysis workflow
 Free performance gains
 Scalable
 Python, Scala, JavaAPIs
 Broad coverage of applications & algorithms
 Rapid improvements in speed & robustness
 Easy to use
 Integrated workflow

MLlib
 MLlib is a machine learning library that provides
various algorithms designed to scale out on a cluster
for classification, regression, clustering, collaborative
filtering, and so on.
 These algorithms also work with streaming data, such
as linear regression using ordinary least squares or k-
means clustering (and more on the way).
 Apache Mahout (a machine learning library for
Hadoop) has already turned away from MapReduce
and joined forces on Spark MLlib.

GraphX
 GraphX is an API for graphs and graph parallel
execution.
 It is a network graph analytics engine.
 GraphX is a library that performs graph-parallel
computation and manipulates graph.
 It has various Spark RDD API so it can help to create
directed graphs with arbitrary properties linked to its
vertex and edges.

GraphX
 GraphX also provides various operator and algorithms
to manipulate graph.
 Clustering, classification, traversal, searching, and
pathfinding is possible in GraphX.

Spark GraphX Features
 Flexibility:
 works with both graphs and computations
 unifies ETL (Extract, Transform & Load), exploratory analysis and
iterative graph computation within a single system.
 We can view the same data as both graphs and collections, transform
and join graphs with RDDs efficiently and write custom iterative graph
algorithms
 Speed:
 provides comparable performance to the fastest specialized graph
processing systems.
 It is comparable with the fastest graph systems while retaining Spark’s
flexibility, fault tolerance and ease of use.

Spark GraphX Features
Growing Algorithm Library:
 We can choose from a growing library of graph
algorithms
 Some of the popular algorithms are page rank,
connected components, label propagation, strongly
connected components and triangle count.

Spark Core
 Shelter to API that contains the backbone of Spark i.e.
RDDs
 The basic functionality of Spark is present in Spark
Core :
 memory management
 fault recovery
 interaction with the storage system
 I/O functionalities like task dispatching

Resilient Distributed Dataset(RDD)
 Spark introduces the concept of an RDD , an
immutable fault-tolerant, distributed collection of
objects that can be operated on in parallel.
 RDD can contain any type of object and is created by
loading an external dataset or distributing a collection
from the driver program.

RDD operation
 RDDs support two types of operations:
 Transformations : transform one data collection into
another (such as map, filter, join, union, and so on),
that are performed on an RDD and which yield a new
RDD containing the result. Means create a new dataset
from an existing one
 Actions : require that the computation be performed
(such as reduce, count, first, collect, save and so on)
that return a value after running a computation on an
RDD. which return a value to the driver program or file
after running a computation on the dataset.

Properties for RDD
 Immutability
 Cacheable – linage – persist
 Lazy evaluation (it different than execution)
 Type Inferred
 Two ways to create RDDs:
 parallelizing an existing collection in your driver program,
 referencing a dataset in an external storage system,
such as a shared file system, HDFS, Hbase, Cassandra or
any data source offering a Hadoop InputFormat.

Spark Streaming
 Spark Streaming is the component of Spark which is
used to process real-time streaming data.
 It enables high-throughput and fault-tolerant stream
processing of live data streams.

SPARK ARCHITECTURE

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to SPARK ARCHITECTURE

Similar to SPARK ARCHITECTURE (20)

More from GauravBiswas9

More from GauravBiswas9 (11)

Recently uploaded

Recently uploaded (20)

SPARK ARCHITECTURE