SlideShare a Scribd company logo
Real-time Processing Systems
Apache Spark
1
Apache Spark
• Apache Spark is a lightning-fast cluster computing designed for fast
computation
• It was built on top of Hadoop MapReduce and it extends the
MapReduce model to efficiently use more types of computations
which includes Interactive Queries and Stream Processing
• Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management
• Spark uses Hadoop in two ways – one is storage and second is
processing. Since Spark has its own cluster management
computation, it uses Hadoop for storage purpose only
2
Apache Spark
• The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application
• Spark is designed to cover a wide range of workloads such as batch
applications, iterative algorithms, interactive queries and streaming
• Apart from supporting all these workload in a respective system, it
reduces the management burden of maintaining separate tools
3
Features of Apache Spark
• Speed − Spark helps to run an application in Hadoop cluster, up to
100 times faster in memory, and 10 times faster when running on
disk. This is possible by reducing number of read/write operations to
disk. It stores the intermediate processing data in memory
• Supports multiple languages − Spark provides built-in APIs in Java,
Scala, or Python. Therefore, you can write applications in different
languages
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It
also supports SQL queries, Streaming data, Machine learning (ML),
and Graph algorithms
4
Components of Spark
• The following illustration depicts the different components of Spark
Apache Spark Core
• Spark Core is the underlying general execution engine for spark platform that all
other functionality is built upon. It provides In-Memory computing and
referencing datasets in external storage systems
5
Components of Spark
Spark SQL
• Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data
Spark Streaming
• Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics.
It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations
on those mini-batches of data
MLlib (Machine Learning Library)
• MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. Spark MLlib is nine times as fast as the Hadoop disk-based
version of Apache Mahout
GraphX
• GraphX is a distributed graph-processing framework on top of Spark. It provides an API for
expressing graph computation that can model the user-defined graphs by using Pregel abstraction
API
6
Spark Architecture
Spark Architecture includes following three main components:
• Data Storage
• API
• Management Framework
Data Storage:
• Spark uses HDFS file system for data storage purposes. It works with
any Hadoop compatible data source including HDFS, HBase,
Cassandra, etc.
7
Spark Architecture
API:
• The API provides the application developers to create Spark based
applications using a standard API interface. Spark provides API for
Scala, Java, and Python programming languages
Resource Management:
• Spark can be deployed as a Stand-alone server or it can be on a
distributed computing framework like Mesos or YARN
8
Resilient Distributed Datasets
• Resilient Distributed Datasets is the core concept in Spark framework
• Spark stores data in RDD on different partitions
• They help with rearranging the computations and optimizing the data
processing
• They are also fault tolerance because an RDD know how to recreate
and recompute the datasets
• RDDs are immutable. You can modify an RDD with a transformation
but the transformation returns you a new RDD whereas the original
RDD remains the same
9
Resilient Distributed Datasets
• It provides API for various transformations and materializations of
data as well as for control over caching and partitioning of elements
to optimize data placement
• RDD can be created either from external storage or from another RDD
and stores information about its parents recompute partition in case
of failure
10
Resilient Distributed Datasets
RDD supports two types of operations:
• Transformation: Transformations don't return a single value, they return a
new RDD. Nothing gets evaluated when you call a Transformation function,
it just takes an RDD and return a new RDD
• Some of the Transformation functions are map, filter, flatMap, groupByKey,
reduceByKey, aggregateByKey, pipe, and coalesce
• Action: Action operation evaluates and returns a new value. When an
Action function is called on a RDD object, all the data processing queries
are computed at that time and the result value is returned
• Some of the Action operations are reduce, collect, count, first, take,
countByKey, and foreach
11
RDD Persistence
• One of the most important capabilities in Spark is persisting (or
caching) a dataset in memory across operations
• When you persist an RDD, each node stores any partitions of it that it
computes in memory and reuses them in other actions on that
dataset. This allows future actions to be much faster
• Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will
automatically be recomputed using the transformations that
originally created it
12
Components
13
Components
• Spark applications run as independent sets of processes on a cluster,
coordinated by the SparkContext object in main program (called the driver
program)
• The SparkContext can connect to several types of cluster managers (either
Spark’s own standalone cluster manager, Mesos or YARN), which allocate
resources across applications
• Spark acquires executors on nodes in the cluster, which are processes that
run computations and store data for application
• Next, it sends application code (defined by JAR or Python files passed to
SparkContext) to the executors
• Finally, SparkContext sends tasks to the executors to run
14
Components
There are several useful things to note about this architecture:
• Each application gets its own executor processes, which stay up for
the duration of the whole application and run tasks in multiple
threads
• The driver program must listen for and accept incoming connections
from its executors throughout its lifetime. As such, the driver program
must be network addressable from the worker nodes
• Because the driver schedules tasks on the cluster, it should be run
close to the worker nodes, preferably on the same local area network
15
Spark Streaming
• Spark Streaming is an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
• Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP
sockets, and can be processed using complex algorithms expressed with
high-level functions like map, reduce, join and window
• Finally, processed data can be pushed out to filesystems
16
Spark Streaming
• The way Spark Streaming works is it divides the live stream of data
into batches (called micro batches) of a pre-defined interval (N
seconds) and then treats each batch of data as RDDs
• It's important to decide the time interval for Spark Streaming, based
on your use case and data processing requirements
• If the value of N is too low, then the micro-batches will not have
enough data to give meaningful results during the analysis
17
Spark Streaming
Figure . How Spark Streaming works
18
Spark Streaming
• Spark Streaming receives live input data streams and divides the data
into batches, which are then processed by the Spark engine to
generate the final stream of results in batches
• Spark Streaming provides a high-level abstraction called discretized
stream or DStream, which represents a continuous stream of data.
Internally, a DStream is represented as a sequence of RDDs
19
Discretized Streams (DStreams)
• It represents a continuous stream of data, either the input data
stream received from source, or the processed data stream generated
by transforming the input stream
• Internally, a DStream is represented by a continuous series of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset
• Each RDD in a DStream contains data from a certain interval
20
Spark runtime components
21
Figure 1: Spark runtime components in cluster deploy mode. Elements of a Spark application are in blue
boxes and an application’s tasks running inside task slots are labeled with a “T”. Unoccupied task slots
are in white boxes.
Responsibilities of the client process
component
• The client process starts the driver program
• For example, the client process can be a spark-submit script for
running applications, a spark-shell script, or a custom application
using Spark API
• The client process prepares the class path and all configuration
options for the Spark application
• It also passes application arguments, if any, to the application running
inside the driver
22
Responsibilities of the driver component
• The driver orchestrates and monitors execution of a Spark application
• There’s always one driver per Spark application
• The Spark context and scheduler – are responsible for:
• Requesting memory and CPU resources from cluster managers
• Breaking application logic into stages and tasks
• Sending tasks to executors
• Collecting the results
23
Responsibilities of the driver component
24
Figure 2: Spark runtime components in client deploy mode. The driver is running inside the client’s
JVM process.
Responsibilities of the driver component
Two basic ways the driver program can be run are:
• Cluster deploy mode is depicted in figure 1. In this mode, the driver
process runs as a separate JVM process inside a cluster, and the
cluster manages its resources
• Client deploy mode is depicted in figure 2. In this mode, the driver’s
running inside the client’s JVM process and communicates with the
executors managed by the cluster
25
Responsibilities of the executors
• The executors, which JVM processes, accept tasks from the driver,
execute those tasks, and return the results to the driver
• Each executor has several task slots (or CPU cores) for running tasks in
parallel
• Although these task slots are often referred to as CPU cores in Spark,
they’re implemented as threads and don’t need to correspond to the
number of physical CPU cores on the machine
26
Creation of the Spark context
• Once the driver’s started, it configures an instance of SparkContext
• When running a standalone Spark application by submitting a jar file,
or by using Spark API from another program, your Spark application
starts and configures the Spark context
• There can be only one Spark context per JVM
27
High-level architecture
• Spark provides a well-defined and layered architecture where all its
layers and components are loosely coupled and integration with
external components/libraries/extensions is performed using well-
defined contracts
28
High-level architecture
• Physical machines: This layer represents the physical or virtual machines/nodes on which Spark jobs are executed. These
nodes collectively represent the total capacity of the cluster with respect to the CPU, memory, and data storage.
• Data storage layer: This layer provides the APIs to store and retrieve the data from the persistent storage area to Spark
jobs/applications. This layer is used by Spark workers to dump data on the persistent storage whenever the cluster
memory is not sufficient to hold the data. Spark is extensible and capable of using any kind of filesystem. RDD, which hold
the data, are agnostic to the underlying storage layer and can persist the data in various persistent storage areas, such as
local filesystems, HDFS, or any other NoSQL database such as HBase, Cassandra, MongoDB, S3, and Elasticsearch.
• Resource manager: The architecture of Spark abstracts out the deployment of the Spark framework and its associated
applications. Spark applications can leverage cluster managers such as YARN and Mesos for the allocation and deallocation
of various physical resources, such as the CPU and memory for the client jobs. The resource manager layer provides the
APIs that are used to request for the allocation and deallocation of available resource across the cluster.
• Spark core libraries: The Spark core library represents the Spark Core engine, which is responsible for the execution of the
Spark jobs. It contains APIs for in-memory distributed data processing and a generalized execution model that supports a
wide variety of applications and languages.
• Spark extensions/libraries: This layer represents the additional frameworks/APIs/libraries developed by extending the
Spark core APIs to support different use cases. For example, Spark SQL is one such extension, which is developed to
perform ad hoc queries and interactive analysis over large datasets.
29
Spark execution model – master worker view
31
Spark execution model – master worker view
• Spark is built around the concepts of Resilient Distributed Datasets
and Direct Acyclic Graph representing transformations and
dependencies between them
32
Spark execution model – master worker view
• Spark Application (often referred to as Driver Program or Application
Master) at high level consists of SparkContext and user code which
interacts with it creating RDDs and performing series of
transformations to achieve final result
• These transformations of RDDs are then translated into DAG and
submitted to Scheduler to be executed on set of worker nodes
33
Execution workflow
• User code containing RDD transformations forms Direct Acyclic Graph
which is then split into stages of tasks by DAGScheduler
• Tasks run on workers and results then return to client
34
Execution workflow
37
Execution workflow
• SparkContext
• represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast
variables on that cluster
• DAGScheduler
• computes a DAG of stages for each job and submits them to TaskScheduler
• determines preferred locations for tasks (based on cache status or shuffle files locations) and finds minimum
schedule to run the jobs
• TaskScheduler
• responsible for sending tasks to the cluster, running them, retrying if there are failures, and mitigating
stragglers
• SchedulerBackend
• backend interface for scheduling systems that allows plugging in different implementations(Mesos, YARN,
Standalone, local)
• BlockManager
• provides interfaces for putting and retrieving blocks both locally and remotely into various stores (memory,
disk, and off-heap)
38
Reference
• Data Stream Management Systems: Apache Spark Streaming
• http://freecontent.manning.com/running-spark-an-overview-of-sparks-runtime-
architecture/
• https://www.packtpub.com/books/content/spark-%E2%80%93-architecture-and-
first-program
• https://0x0fff.com/spark-architecture/
• http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/
• https://github.com/apache/spark
• http://spark.apache.org/docs/latest/
• https://github.com/JerryLead/SparkInternals
39
THANKS !
40

More Related Content

What's hot

Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a service
Robert Evans
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
Sonal Raj
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
DataWorks Summit/Hadoop Summit
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridDataWorks Summit
 
Storm
StormStorm
Storm
nathanmarz
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
Tiziano De Matteis
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
Farzad Nozarian
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
P. Taylor Goetz
 
Spark vs storm
Spark vs stormSpark vs storm
Spark vs storm
Trong Ton
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
What’s expected in Java 9
What’s expected in Java 9What’s expected in Java 9
What’s expected in Java 9
Gal Marder
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
DataWorks Summit/Hadoop Summit
 
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARNOne Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARNDataWorks Summit
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
T Jake Luciani
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
Uday Vakalapudi
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
Lester Martin
 
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
Humoyun Ahmedov
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
DataWorks Summit/Hadoop Summit
 

What's hot (20)

Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a service
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
 
Storm
StormStorm
Storm
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Spark vs storm
Spark vs stormSpark vs storm
Spark vs storm
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
What’s expected in Java 9
What’s expected in Java 9What’s expected in Java 9
What’s expected in Java 9
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARNOne Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
 

Similar to Apache Spark

Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Apache spark
Apache sparkApache spark
Apache spark
Prashant Pranay
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
Dona Mary Philip
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
Aishg4
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 

Similar to Apache Spark (20)

Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache spark
Apache sparkApache spark
Apache spark
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 

Recently uploaded

Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 

Recently uploaded (20)

Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 

Apache Spark

  • 2. Apache Spark • Apache Spark is a lightning-fast cluster computing designed for fast computation • It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations which includes Interactive Queries and Stream Processing • Spark is not a modified version of Hadoop and is not, really, dependent on Hadoop because it has its own cluster management • Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its own cluster management computation, it uses Hadoop for storage purpose only 2
  • 3. Apache Spark • The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application • Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming • Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools 3
  • 4. Features of Apache Spark • Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory • Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages • Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms 4
  • 5. Components of Spark • The following illustration depicts the different components of Spark Apache Spark Core • Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. It provides In-Memory computing and referencing datasets in external storage systems 5
  • 6. Components of Spark Spark SQL • Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data Spark Streaming • Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data MLlib (Machine Learning Library) • MLlib is a distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout GraphX • GraphX is a distributed graph-processing framework on top of Spark. It provides an API for expressing graph computation that can model the user-defined graphs by using Pregel abstraction API 6
  • 7. Spark Architecture Spark Architecture includes following three main components: • Data Storage • API • Management Framework Data Storage: • Spark uses HDFS file system for data storage purposes. It works with any Hadoop compatible data source including HDFS, HBase, Cassandra, etc. 7
  • 8. Spark Architecture API: • The API provides the application developers to create Spark based applications using a standard API interface. Spark provides API for Scala, Java, and Python programming languages Resource Management: • Spark can be deployed as a Stand-alone server or it can be on a distributed computing framework like Mesos or YARN 8
  • 9. Resilient Distributed Datasets • Resilient Distributed Datasets is the core concept in Spark framework • Spark stores data in RDD on different partitions • They help with rearranging the computations and optimizing the data processing • They are also fault tolerance because an RDD know how to recreate and recompute the datasets • RDDs are immutable. You can modify an RDD with a transformation but the transformation returns you a new RDD whereas the original RDD remains the same 9
  • 10. Resilient Distributed Datasets • It provides API for various transformations and materializations of data as well as for control over caching and partitioning of elements to optimize data placement • RDD can be created either from external storage or from another RDD and stores information about its parents recompute partition in case of failure 10
  • 11. Resilient Distributed Datasets RDD supports two types of operations: • Transformation: Transformations don't return a single value, they return a new RDD. Nothing gets evaluated when you call a Transformation function, it just takes an RDD and return a new RDD • Some of the Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce • Action: Action operation evaluates and returns a new value. When an Action function is called on a RDD object, all the data processing queries are computed at that time and the result value is returned • Some of the Action operations are reduce, collect, count, first, take, countByKey, and foreach 11
  • 12. RDD Persistence • One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations • When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset. This allows future actions to be much faster • Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it 12
  • 14. Components • Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in main program (called the driver program) • The SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications • Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for application • Next, it sends application code (defined by JAR or Python files passed to SparkContext) to the executors • Finally, SparkContext sends tasks to the executors to run 14
  • 15. Components There are several useful things to note about this architecture: • Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads • The driver program must listen for and accept incoming connections from its executors throughout its lifetime. As such, the driver program must be network addressable from the worker nodes • Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network 15
  • 16. Spark Streaming • Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams • Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window • Finally, processed data can be pushed out to filesystems 16
  • 17. Spark Streaming • The way Spark Streaming works is it divides the live stream of data into batches (called micro batches) of a pre-defined interval (N seconds) and then treats each batch of data as RDDs • It's important to decide the time interval for Spark Streaming, based on your use case and data processing requirements • If the value of N is too low, then the micro-batches will not have enough data to give meaningful results during the analysis 17
  • 18. Spark Streaming Figure . How Spark Streaming works 18
  • 19. Spark Streaming • Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches • Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. Internally, a DStream is represented as a sequence of RDDs 19
  • 20. Discretized Streams (DStreams) • It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream • Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset • Each RDD in a DStream contains data from a certain interval 20
  • 21. Spark runtime components 21 Figure 1: Spark runtime components in cluster deploy mode. Elements of a Spark application are in blue boxes and an application’s tasks running inside task slots are labeled with a “T”. Unoccupied task slots are in white boxes.
  • 22. Responsibilities of the client process component • The client process starts the driver program • For example, the client process can be a spark-submit script for running applications, a spark-shell script, or a custom application using Spark API • The client process prepares the class path and all configuration options for the Spark application • It also passes application arguments, if any, to the application running inside the driver 22
  • 23. Responsibilities of the driver component • The driver orchestrates and monitors execution of a Spark application • There’s always one driver per Spark application • The Spark context and scheduler – are responsible for: • Requesting memory and CPU resources from cluster managers • Breaking application logic into stages and tasks • Sending tasks to executors • Collecting the results 23
  • 24. Responsibilities of the driver component 24 Figure 2: Spark runtime components in client deploy mode. The driver is running inside the client’s JVM process.
  • 25. Responsibilities of the driver component Two basic ways the driver program can be run are: • Cluster deploy mode is depicted in figure 1. In this mode, the driver process runs as a separate JVM process inside a cluster, and the cluster manages its resources • Client deploy mode is depicted in figure 2. In this mode, the driver’s running inside the client’s JVM process and communicates with the executors managed by the cluster 25
  • 26. Responsibilities of the executors • The executors, which JVM processes, accept tasks from the driver, execute those tasks, and return the results to the driver • Each executor has several task slots (or CPU cores) for running tasks in parallel • Although these task slots are often referred to as CPU cores in Spark, they’re implemented as threads and don’t need to correspond to the number of physical CPU cores on the machine 26
  • 27. Creation of the Spark context • Once the driver’s started, it configures an instance of SparkContext • When running a standalone Spark application by submitting a jar file, or by using Spark API from another program, your Spark application starts and configures the Spark context • There can be only one Spark context per JVM 27
  • 28. High-level architecture • Spark provides a well-defined and layered architecture where all its layers and components are loosely coupled and integration with external components/libraries/extensions is performed using well- defined contracts 28
  • 29. High-level architecture • Physical machines: This layer represents the physical or virtual machines/nodes on which Spark jobs are executed. These nodes collectively represent the total capacity of the cluster with respect to the CPU, memory, and data storage. • Data storage layer: This layer provides the APIs to store and retrieve the data from the persistent storage area to Spark jobs/applications. This layer is used by Spark workers to dump data on the persistent storage whenever the cluster memory is not sufficient to hold the data. Spark is extensible and capable of using any kind of filesystem. RDD, which hold the data, are agnostic to the underlying storage layer and can persist the data in various persistent storage areas, such as local filesystems, HDFS, or any other NoSQL database such as HBase, Cassandra, MongoDB, S3, and Elasticsearch. • Resource manager: The architecture of Spark abstracts out the deployment of the Spark framework and its associated applications. Spark applications can leverage cluster managers such as YARN and Mesos for the allocation and deallocation of various physical resources, such as the CPU and memory for the client jobs. The resource manager layer provides the APIs that are used to request for the allocation and deallocation of available resource across the cluster. • Spark core libraries: The Spark core library represents the Spark Core engine, which is responsible for the execution of the Spark jobs. It contains APIs for in-memory distributed data processing and a generalized execution model that supports a wide variety of applications and languages. • Spark extensions/libraries: This layer represents the additional frameworks/APIs/libraries developed by extending the Spark core APIs to support different use cases. For example, Spark SQL is one such extension, which is developed to perform ad hoc queries and interactive analysis over large datasets. 29
  • 30. Spark execution model – master worker view 31
  • 31. Spark execution model – master worker view • Spark is built around the concepts of Resilient Distributed Datasets and Direct Acyclic Graph representing transformations and dependencies between them 32
  • 32. Spark execution model – master worker view • Spark Application (often referred to as Driver Program or Application Master) at high level consists of SparkContext and user code which interacts with it creating RDDs and performing series of transformations to achieve final result • These transformations of RDDs are then translated into DAG and submitted to Scheduler to be executed on set of worker nodes 33
  • 33. Execution workflow • User code containing RDD transformations forms Direct Acyclic Graph which is then split into stages of tasks by DAGScheduler • Tasks run on workers and results then return to client 34
  • 35. Execution workflow • SparkContext • represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster • DAGScheduler • computes a DAG of stages for each job and submits them to TaskScheduler • determines preferred locations for tasks (based on cache status or shuffle files locations) and finds minimum schedule to run the jobs • TaskScheduler • responsible for sending tasks to the cluster, running them, retrying if there are failures, and mitigating stragglers • SchedulerBackend • backend interface for scheduling systems that allows plugging in different implementations(Mesos, YARN, Standalone, local) • BlockManager • provides interfaces for putting and retrieving blocks both locally and remotely into various stores (memory, disk, and off-heap) 38
  • 36. Reference • Data Stream Management Systems: Apache Spark Streaming • http://freecontent.manning.com/running-spark-an-overview-of-sparks-runtime- architecture/ • https://www.packtpub.com/books/content/spark-%E2%80%93-architecture-and- first-program • https://0x0fff.com/spark-architecture/ • http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/ • https://github.com/apache/spark • http://spark.apache.org/docs/latest/ • https://github.com/JerryLead/SparkInternals 39