Apache spark

APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 1
1 INTRODUCTION
Industries are using Hadoop extensively to analyze their data sets. The
reason is that Hadoop framework is based on a simple programming model
(MapReduce) and it enables a computing solution that is scalable, flexible, fault-
tolerant and cost effective. Here, the main concern is to maintain speed in
processing large datasets in terms of waiting time between queries and waiting time
to run the program. Spark was introduced by Apache Software Foundation for
speeding up the Hadoop computational computing software process. As against a
common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just
one of the ways to implement Spark. Spark uses Hadoop in two ways – one
is storage and second is processing. Since Spark has its own cluster management
computation, it uses Hadoop for storage purpose only.
Apache Spark is an open source big data processing framework
built around speed, ease of use, and sophisticated analytics. It was originally
developed in 2009 in UC Berkeley s AMPLab, and open sourced in 2010 as an
Apache project. Spark has several advantages compared to other big data and
MapReduce technologies like Hadoop and Storm. First of all, Spark gives us a
comprehensive, unified framework to manage big data processing requirements
with a variety of data sets that are diverse in nature (text data, graph data etc) as

APACHE SPARK
well as the source of data (batch v. real-time streaming data). Spark enables
applications in Hadoop clusters to run up to 100 times faster in memory and 10
times faster even when running on disk. Spark lets you quickly write applications in
Java, Scala, or Python. It comes with a built-in set of over 80 high level operators.
And you can use it interactively to query data within the shell. In addition to Map
and Reduce operations, it supports SQL queries, streaming data, machine learning
and graph data processing. Developers can use these capabilities stand-alone or
combine them to run in a single data pipeline use case. Apache Spark provides
programmers with an application programming interface cantered on a data
structure called the resilient distributed dataset (RDD), a read-only multiset of data
items distributed over a cluster of machines, that is maintained in a fault-tolerant
way. It was developed in response to limitations in the MapReduce cluster
computing paradigm, which forces a particular linear dataflow structure on
distributed programs: MapReduce programs read input data from disk, map a
function across the data, reduce the results of the map, and store reduction results
on disk. Spark's RDDs function as a working set for distributed programs that offers a
(deliberately) restricted form of distributed shared memory. The availability of RDDs
facilitates the implementation of both iterative algorithms, that visit their dataset
multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeated
database-style querying of data. The latency of such applications (compared to a
MapReduce implementation, as was common in Apache Hadoop stacks) may be
reduced by several orders of magnitude. Among the class of iterative algorithms are

APACHE SPARK
the training algorithms for machine learning systems, which formed the initial
impetus for developing Apache Spark. Apache Spark requires a cluster manager and
a distributed storage system. For cluster management, Spark supports standalone
(native Spark cluster), Hadoop YARN, or Apache Mesos. For distributed storage,
Spark can interface with a wide variety, including Hadoop Distributed File System
(HDFS), MapR File System (MapR-FS), Cassandra, OpenStack Swift, Amazon S3,
Kudu, or a custom solution can be implemented. Spark also supports a pseudo-
distributed local mode, usually used only for development or testing purposes,
where distributed storage is not required and the local file system can be used
instead; in such a scenario, Spark is run on a single machine with one executor per
CPU core.
2 EVOLUTION OF APACHE SPARK
Spark is one of Hadoop s sub project developed in 2009 in UC
Berkeley s AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD
license. It was donated to Apache software foundation in 2013, and now Apache
Spark has become a top level Apache project from Feb-2014.
3 KEY FEATURES
Apache Spark has following features.
1. Speed: Spark helps to run an application in Hadoop cluster, up to 100
times faster in memory, and 10 times faster when running on disk. This is

APACHE SPARK
possible by reducing number of read/write operations to disk. It stores
the intermediate processing data in memory.
2. Supports multiple languages: Spark provides built-in APIs in Java, Scala,
or Python. Therefore, you can write applications in different languages.
3. Advanced Analytics: Spark not only supports Map and reduce . It also
supports SQL queries, Streaming data, Machine learning (ML), and Graph
algorithms.
Spark takes MapReduce to the next level with less expensive shuffles in the data
processing. With capabilities like in-memory data storage and near real-time
processing, the performance can be several times faster than other big data
technologies. Spark also supports lazy evaluation of big data queries, which helps
with optimization of the steps in data processing workflows. It provides a higher
level API to improve developer productivity and a consistent architect model for big
data solutions. Spark holds intermediate results in memory rather than writing them
to disk which is very useful especially when you need to work on the same dataset
multiple times. It s designed to be an execution engine that works both in-memory
and on-disk. Spark operators perform external operations when data does not fit in
memory. Spark can be used for processing datasets that larger than the aggregate
memory in a cluster. Spark will attempt to store as much as data in memory and
then will spill to disk. It can store part of a data set in memory and the remaining
data on the disk. You have to look at your data and use cases to assess the memory

APACHE SPARK
requirements. With this in-memory data storage, Spark comes with performance
advantage. Other Spark features include:
 Supports more than just Map and Reduce functions.
 Optimizes arbitrary operator graphs.
 Lazy evaluation of big data queries which helps with the optimization of
the overall data processing workflow.
 Provides concise and consistent APIs in Scala, Java and Python.
 Offers interactive shell for Scala and Python. This is not available in Java
yet.
Spark is written in Scala Programming Language and runs on Java Virtual Machine
(JVM) environment. It currently supports the following languages for developing
applications using Spark:
1. Scala
2. Java
3. Python
4. R
4 DETAILED DESCRIPTION
Apache Spark consists of several main components including Spark core and
upper-level libraries Spark core runs on different cluster managers and can access
data in any Hadoop data source. In addition, many packages have been built to
work with Spark core and the upper-level libraries.

APACHE SPARK
4.1. Spark Built on Hadoop
The following diagram shows three ways of how Spark can be built with
Hadoop components.
Fig.1 Spark built with Hadoop components
There are three ways of Spark deployment as explained below.
 Standalone: Spark Standalone deployment means Spark occupies the
place on top of HDFS(Hadoop Distributed File System) and space is
allocated for HDFS, explicitly. Here, Spark and MapReduce will run side
by side to cover all spark jobs on cluster.
 Hadoop Yarn: Hadoop Yarn deployment means, simply, spark runs on
Yarn without any pre-installation or root access required. It helps to

APACHE SPARK
integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other
components to run on top of stack.
 Spark in MapReduce (SIMR): Spark in MapReduce is used to launch
spark job in addition to standalone deployment. With SIMR, user can
start Spark and uses its shell without any administrative access.
5 COMPONENTS OF SPARK
Fig.2 Components of spark
Apache Spark Core
Spark Core is the underlying general execution engine for spark platform that
all other functionality is built upon. It provides In-Memory computing and
referencing datasets in external storage systems.
Spark SQL
It enables users to run SQL/HQL queries on the top of Spark. Using Apache
Spark SQL, we can process structured as well as semi-structured data. It also

APACHE SPARK
provides an engine for Hive to run unmodified queries up to 100 times faster
on existing deployments.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform
streaming analytics. It ingests data in mini-batches and performs RDD
(Resilient Distributed Datasets) transformations on those mini-batches of
data.
MLlib (Machine Learning Library)
It is the scalable machine learning library which delivers both
efficiencies as well as the high-quality algorithm. Apache Spark MLlib is one of
the hottest choices for Data Scientist due to its capability of in-memory data
processing, which improves the performance of iterative algorithm drastically.
GraphX
GraphX is a distributed graph-processing framework on top of Spark.
It is the graph computation engine built on top of spark that enables to
process graph data at scale.
6 SPARK ARCHITECTURE
Apache Spark follows a master/slave architecture with two main daemons
and a cluster manager –
i. Master Daemon – (Master/Driver Process)
ii. Worker Daemon –(Slave Process)

APACHE SPARK
Fig 3 Architecture
A spark cluster has a single Master and any number of Slaves/Workers. The
driver and the executors run their individual Java processes and users can run
them on the same horizontal spark cluster or on separate machines i.e. in a
vertical spark cluster or in mixed machine configuration.
Fig 4 Execution of spark
Role of Driver in Spark Architecture
Spark Driver – Master Node of a Spark Application
It is the central point and the entry point of the Spark Shell (Scala, Python, and R).
The driver program runs the main () function of the application and is the place
where the Spark Context is created. Spark Driver contains various components –

APACHE SPARK
DAGScheduler, TaskScheduler, BackendScheduler and BlockManager responsible for
the translation of spark user code into actual spark jobs executed on the cluster.
 The driver program that runs on the master node of the spark cluster
schedules the job execution and negotiates with the cluster manager.
 It translates the RDD s into the execution graph and splits the graph into
multiple stages.
 Driver stores the metadata about all the Resilient Distributed Databases
and their partitions.
Role of Executor in Spark Architecture
Executor is a distributed agent responsible for the execution of tasks. Every
spark applications has its own executor process. Executors usually run for the
entire lifetime of a Spark application and this phenomenon is known as Static
Allocation of Executors . However, users can also opt for dynamic allocations
of executors wherein they can add or remove spark executors dynamically to
match with the overall workload.
 Executor performs all the data processing.
 Reads from and Writes data to external sources.
 Executor stores the computation results data in-memory, cache or on hard
disk drives.
 Interacts with the storage systems.

APACHE SPARK
Role of Cluster Manager in Spark Architecture
Apache Spark is an engine for Big Data processing. One can run Spark on
distributed mode on the cluster. In the cluster, there is master and n number of
workers. It schedules and divides resource in the host machine which forms the
cluster. The prime work of the cluster manager is to divide resources across
applications. It works as an external service for acquiring resources on the cluster.
The cluster manager dispatches work for the cluster. Spark supports pluggable
cluster management. The cluster manager in Spark handles starting executor
processes.
7 RESILENT DISTRIBUTED DATASETS
RDD (Resilient Distributed Dataset) is the fundamental data
structure of Apache Spark which are an immutable collection of objects which
computes on the different node of the cluster. Each and every dataset in Spark
RDD is logically partitioned across many servers so that they can be computed on
different nodes of the cluster.
Decomposing the name RDD:
 Resilient, i.e. fault-tolerant with the help of RDD lineage graph (DAG) and so
able to recompute missing or damaged partitions due to node failures.
 Distributed, since Data resides on multiple nodes.

APACHE SPARK
 Dataset represents records of the data you work with. The user can load the
data set externally which can be either JSON file, CSV file, text file or database
via JDBC with no specific data structure.
Hence, each and every dataset in RDD is logically partitioned across many servers
so that they can be computed on different nodes of the cluster. RDDs are fault
tolerant i.e. It posses self-recovery in the case of failure
Spark makes use of the concept of RDD to achieve faster and
efficient MapReduce operations. Let us first discuss how MapReduce operations
take place and why they are not so efficient.
7.1 Features of RDD in Spark
Several features of Apache Spark RDD are:
In-memory Computation
Spark RDDs have a provision of in-memory computation. It stores
intermediate results in distributed memory (RAM) instead of stable storage
(disk).
Lazy Evaluations
All transformations in Apache Spark are lazy, in that they do not
compute their results right away. Instead, they just remember the
transformations applied to some base data set.
Spark computes transformations when an action requires a result for the driver
program. Follow this guide for the deep study of Spark Lazy Evaluation.

APACHE SPARK
Fault Tolerance
Spark RDDs are fault tolerant as they track data lineage information
to rebuild lost data automatically on failure. They rebuild lost data on failure
using lineage, each RDD remembers how it was created from other datasets (by
transformations like a map, join or groupBy) to recreate itself. Follow this guide
for the deep study of RDD Fault Tolerance.
Immutability
Data is safe to share across processes. It can also be created or
retrieved anytime which makes caching, sharing & replication easy. Thus, it is a
way to reach consistency in computations.
Partitioning
Partitioning is the fundamental unit of parallelism in Spark RDD.
Each partition is one logical division of data which is mutable. One can create a
partition through some transformations on existing partitions.
5Persistence
Users can state which RDDs they will reuse and choose a storage
strategy for them (e.g., in-memory storage or on Disk).
7.2 Spark RDD Operations
RDD in Apache Spark supports two types of operations:
 Transformation
 Actions

APACHE SPARK
Transformations
Spark RDD Transformations are functions that take an RDD as the input and
produce one or many RDDs as the output. They do not change the input RDD
(since RDDs are immutable and hence one cannot change it), but always produce
one or more new RDDs by applying the computations they represent e.g. Map(),
filter(), reduceByKey() etc.
Transformations are lazy operations on an RDD in Apache Spark. It creates
one or many new RDDs, which executes when an Action occurs. Hence,
Transformation creates a new dataset from an existing one.Certain
transformations can be pipelined which is an optimization method, that Spark
uses to improve the performance of computations. There are two kinds of
transformations: narrow transformation, wide transformation.
Actions
An Action in Spark returns final result of RDD computations. It triggers
execution using lineage graph to load the data into original RDD, carry out all
intermediate transformations and return final results to Driver program or write
RDD operations
Transformations Actions

APACHE SPARK
it out to file system. Lineage graph is dependency graph of all parallel RDDs of
RDD.
Actions are RDD operations that produce non-RDD values. They
materialize a value in a Spark program. An Action is one of the ways to send
result from executors to the driver. First(), take(), reduce(), collect(), the count()
is some of the Actions in spark.
Using transformations, one can create RDD from the existing one. But
when we want to work with the actual dataset, at that point we use Action.
When the Action occurs it does not create the new RDD, unlike transformation.
Thus, actions are RDD operations that give no RDD values. Action stores its value
either to drivers or to the external storage system. It brings laziness of RDD into
motion.
7.3 Lazy Evaluation
lazy evaluation in Spark means that the execution will not start until an
action is triggered. In Spark, the picture of lazy evaluation comes when Spark
transformations occur. Transformations are lazy in nature meaning when we call
some operation in RDD, it does not execute immediately. Spark maintains the
record of which operation is being called. We can think Spark RDD as the data,
that we built up through transformation. Since transformations are lazy in
nature, so we can execute operation any time by calling an action on data.
Hence, in lazy evaluation data is not loaded until it is necessary

APACHE SPARK
7.4 Data Sharing Using Spark Rdd
Data sharing is slow in MapReduce due to replication, serialization, and
disk IO. Most of the Hadoop applications, they spend more than 90% of the time
doing HDFS read-write operations.
Recognizing this problem, researchers developed a specialized framework
called Apache Spark. The key idea of spark is Resilient Distributed Datasets
(RDD); it supports in-memory processing computation. This means, it stores the
state of memory as an object across the jobs and the object is sharable between
those jobs. Data sharing in memory is 10 to 100 times faster than network and
Disk.
Let us now try to find out how iterative and interactive operations take place in
Spark RDD.
7.5 Iterative Operations on Spark RDD
The illustration given below shows the iterative operations on Spark RDD.
It will store intermediate results in a distributed memory instead of Stable
storage (Disk) and make the system faster.
Note: If the Distributed memory (RAM) is sufficient to store intermediate results
(State of the JOB), then it will store those results on the disk.

APACHE SPARK
Fig 5: Iterative operations on Spark RDD
7.6 Interactive Operations on Spark RDD
This illustration shows interactive operations on Spark RDD. If different
queries are run on the same set of data repeatedly, this particular data can be
kept in memory for better execution times.
Fig 6 Interactive operations on Spark RDD
By default, each transformed RDD may be recomputed each time you run an
action on it. However, you may also persist an RDD in memory, in which case
Spark will keep the elements around on the cluster for much faster access, the
next time you query it. There is also support for persisting RDDs on disk, or
replicated across multiple nodes.

APACHE SPARK
8 Spark Context
SparkContext is the entry gate of Apache Spark functionality. The most
important step of any Spark driver application is to generate SparkContext. It
allows your Spark Application to access Spark Cluster with the help of Resource
Manager. It interacts with cluster manager.
Fig. spark context
9 SPARK EXECUTION
When a client submits a spark user application code, the driver implicitly
converts the code containing transformations and actions into a logical directed
acyclic graph. At this stage, the driver program also performs certain
optimizations like pipelining transformations and then it converts the logical DAG
into physical execution plan with set of stages. After creating the physical
execution plan, it creates small physical execution units referred to as tasks
under each stage. Then tasks are bundled to be sent to the Spark Cluster.

APACHE SPARK
The driver program then talks to the cluster manager and negotiates for
resources. The cluster manager then launches executors on the worker nodes on
behalf of the driver. At this point the driver sends tasks to the cluster manager
based on data placement. Before executors begin execution, they register
themselves with the driver program so that the driver has holistic view of all the
executors. Now executors start executing the various tasks assigned by the driver
program. At any point of time when the spark application is running, the driver
program will monitor the set of executors that run. Driver program in the spark
architecture also schedules future tasks based on data placement by tracking the
location of cached data. When driver programs main () method exits or when it
call the stop () method of the Spark Context, it will terminate all the executors
and release the resources from the cluster manager.
10 APACHE SPARK USE CASES
The creators of Apache Spark polled a survey on Why companies should use
in-memory computing framework like Apache Spark? and the results of the
survey are overwhelming –
91% use Apache Spark because of its performance gains.
77% use Apache Spark as it is easy to use.
71% use Apache Spark due to the ease of deployment.
64% use Apache Spark to leverage advanced analytics
52% use Apache Spark for real-time streaming.

APACHE SPARK
Apache Spark at Alibaba
One of the world s largest e-commerce platforms Alibaba Taobao runs some
of the largest Apache Spark jobs in the world in order to analyse hundreds of
petabytes of data on its ecommerce platform. Some of the Spark jobs that
perform feature extraction on image data, run for several weeks. Millions of
merchants and users interact with Alibaba Taobao s ecommerce platform. Each
of these interactions is represented as a complicated large graph and apache
spark is used for fast processing of sophisticated machine learning on this data.
Apache Spark at eBay
eBay uses Apache Spark to provide targeted offers, enhance customer
experience, and to optimize the overall performance. Apache Spark is leveraged
at eBay through Hadoop YARN.YARN manages all the cluster resources to run
generic tasks. EBay spark users leverage the Hadoop clusters in the range of 2000
nodes, 20,000 cores and 100TB of RAM through YARN.
Apache Spark at MyFitnessPal
The largest health and fitness community MyFitnessPal helps people achieve
a healthy lifestyle through better diet and exercise. MyFitnessPal uses apache
spark to clean the data entered by users with the end goal of identifying high
quality food items. Using Spark, MyFitnessPal has been able to scan through food
calorie data of about 80 million users. Earlier, MyFitnessPal used Hadoop to

APACHE SPARK
process 2.5TB of data and that took several days to identify any errors or missing
information in it.
Apache Spark at Yahoo for News Personalization
Yahoo uses Apache Spark for personalizing its news web pages and for
targeted advertising. It uses machine learning algorithms that run on Apache
Spark to find out what kind of news - users are interested to read and
categorizing the news stories to find out what kind of users would be interested
in reading each category of news.
Earlier the machine learning algorithm for news personalization required
15000 lines of C++ code but now with Spark Scala the machine learning algorithm
for news personalization has just 120 lines of Scala programming code. The
algorithm was ready for production use in just 30 minutes of training, on a
hundred million datasets.
Apache Spark at Conviva
The largest streaming video company Conviva uses Apache Spark to deliver
quality of service to its customers by removing the screen buffering and learning
in detail about the network conditions in real-time. This information is stored in
the video player to manage live video traffic coming from close to 4 billion video
feeds every month, to ensure maximum play-through. Apache Spark is helping
Conviva reduce its customer churn to a great extent by providing its customers
with a smooth video viewing experience.

APACHE SPARK
Apache Spark at Netflix
Netflix uses Apache Spark for real-time stream processing to provide online
recommendations to its customers. Streaming devices at Netflix send events
which capture all member activities and play a vital role in personalization. It
processes 450 billion events per day which flow to server side applications and
are directed to Apache Kafka.
Apache Spark at Pinterest
Pinterest is using apache spark to discover trends in high value user
engagement data so that it can react to developing trends in real-time by getting
an in-depth understanding of user behaviour on the website.
Apache Spark at TripAdvisor
TripAdvisor, a leading travel website that helps users plan a perfect trip is
using Apache Spark to speed up its personalized customer recommendations.
TripAdvisor uses apache spark to provide advice to millions of travellers by
comparing hundreds of websites to find the best hotel prices for its customers.
The time taken to read and process the reviews of the hotels in a readable
format is done with the help of Apache Spark.

APACHE SPARK
11 DISADVANTAGES
Near real time processing
In Spark Streaming, the arriving live stream of data is divided into batches of
the pre-defined interval, and each batch of data is treated like Spark Resilient
Distributed Database (RDDs). Then these RDDs are processed using the operations
like map, reduce, join etc. The result of these operations is returned in batches.
Thus, it is not real time processing but Spark is near real-time processing of live
data. Micro-batch processing takes place in Spark Streaming
No file management system
Apache Spark does not have its own file management system, thus it relies on
some other platform like Hadoop or another cloud-based platform which is one of
the Spark known issues.
Expensive
In-memory capability can become a bottleneck when we want cost-efficient
processing of big data as keeping data in memory is quite expensive, the memory
consumption is very high, and it is not handled in a user-friendly manner. Apache
Spark requires lots of RAM to run in-memory, thus the cost of Spark is quite high.
Although Spark has many drawbacks, it is still popular in the market
for big data solution. But there are various technologies that are overtaking Spark.
Like stream processing is much better using Flink then Spark as it is real time
processing.

APACHE SPARK
12 APACHE SPARK VERSIONS
Version Original
release date
Latest
version
Release
date
0.5 2012-06-12 0.5.1 2012-10-07
0.6 2012-10-14 0.6.2 2013-02-07
0.7 2013-02-27 0.7.3 2013-07-16
0.8 2013-09-25 0.8.1 2013-12-19
0.9 2014-02-02 0.9.2 2014-07-23
1.0 2014-05-30 1.0.2 2014-08-05
1.1 2014-09-11 1.1.1 2014-11-26
1.2 2014-12-18 1.2.2 2015-04-17
1.3 2015-03-13 1.3.1 2015-04-17
1.4 2015-06-11 1.4.1 2015-07-15
1.5 2015-09-09 1.5.2 2015-11-09
1.6 2016-01-04 1.6.3 2016-11-07
2.0 2016-07-26 2.0.2 2016-11-14
2.1 2016-12-28 2.1.1 2017-05-02
2.2 2017-07-11 2.2.0 2017-07-11

APACHE SPARK
13 CONCLUTION AND FUTURE SCOPE
Apache Spark is opening up various opportunities for big data examination and
making it easier for administrations to solve different kinds of big data problems.
Spark is the hottest technology now, not simply among the data engineers however,
the even majority of data scientists highly prefer to work with Spark. Apache Spark
is a fascinating platform for data scientists with use cases spanning across
exploratory and operational analytics.
Apache Spark has witnessed continuous upward trajectory in the big data
ecosystem. With IBM s recent announcement that it'll educate quite one million
data engineers and data scientists on Apache Spark – 2016 is unquestionably THE
year to be told Spark and pursue a profitable career.

APACHE SPARK
14 REFERENCES
1) https://spark.apache.org/
2) https://en.wikipedia.org/wiki/Apache_Spark
3) https://www.infoq.com/articles/
4) https://www.tutorialspoint.com/apache_spark/
5) https://data-flair
6) https://www.dezyre.com/

Apache spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache spark

Similar to Apache spark (20)

Recently uploaded

Recently uploaded (20)

Apache spark