SlideShare a Scribd company logo
1 of 26
Download to read offline
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 1
1 INTRODUCTION
Industries are using Hadoop extensively to analyze their data sets. The
reason is that Hadoop framework is based on a simple programming model
(MapReduce) and it enables a computing solution that is scalable, flexible, fault-
tolerant and cost effective. Here, the main concern is to maintain speed in
processing large datasets in terms of waiting time between queries and waiting time
to run the program. Spark was introduced by Apache Software Foundation for
speeding up the Hadoop computational computing software process. As against a
common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just
one of the ways to implement Spark. Spark uses Hadoop in two ways – one
is storage and second is processing. Since Spark has its own cluster management
computation, it uses Hadoop for storage purpose only.
Apache Spark is an open source big data processing framework
built around speed, ease of use, and sophisticated analytics. It was originally
developed in 2009 in UC Berkeley s AMPLab, and open sourced in 2010 as an
Apache project. Spark has several advantages compared to other big data and
MapReduce technologies like Hadoop and Storm. First of all, Spark gives us a
comprehensive, unified framework to manage big data processing requirements
with a variety of data sets that are diverse in nature (text data, graph data etc) as
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 2
well as the source of data (batch v. real-time streaming data). Spark enables
applications in Hadoop clusters to run up to 100 times faster in memory and 10
times faster even when running on disk. Spark lets you quickly write applications in
Java, Scala, or Python. It comes with a built-in set of over 80 high level operators.
And you can use it interactively to query data within the shell. In addition to Map
and Reduce operations, it supports SQL queries, streaming data, machine learning
and graph data processing. Developers can use these capabilities stand-alone or
combine them to run in a single data pipeline use case. Apache Spark provides
programmers with an application programming interface cantered on a data
structure called the resilient distributed dataset (RDD), a read-only multiset of data
items distributed over a cluster of machines, that is maintained in a fault-tolerant
way. It was developed in response to limitations in the MapReduce cluster
computing paradigm, which forces a particular linear dataflow structure on
distributed programs: MapReduce programs read input data from disk, map a
function across the data, reduce the results of the map, and store reduction results
on disk. Spark's RDDs function as a working set for distributed programs that offers a
(deliberately) restricted form of distributed shared memory. The availability of RDDs
facilitates the implementation of both iterative algorithms, that visit their dataset
multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeated
database-style querying of data. The latency of such applications (compared to a
MapReduce implementation, as was common in Apache Hadoop stacks) may be
reduced by several orders of magnitude. Among the class of iterative algorithms are
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 3
the training algorithms for machine learning systems, which formed the initial
impetus for developing Apache Spark. Apache Spark requires a cluster manager and
a distributed storage system. For cluster management, Spark supports standalone
(native Spark cluster), Hadoop YARN, or Apache Mesos. For distributed storage,
Spark can interface with a wide variety, including Hadoop Distributed File System
(HDFS), MapR File System (MapR-FS), Cassandra, OpenStack Swift, Amazon S3,
Kudu, or a custom solution can be implemented. Spark also supports a pseudo-
distributed local mode, usually used only for development or testing purposes,
where distributed storage is not required and the local file system can be used
instead; in such a scenario, Spark is run on a single machine with one executor per
CPU core.
2 EVOLUTION OF APACHE SPARK
Spark is one of Hadoop s sub project developed in 2009 in UC
Berkeley s AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD
license. It was donated to Apache software foundation in 2013, and now Apache
Spark has become a top level Apache project from Feb-2014.
3 KEY FEATURES
Apache Spark has following features.
1. Speed: Spark helps to run an application in Hadoop cluster, up to 100
times faster in memory, and 10 times faster when running on disk. This is
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 4
possible by reducing number of read/write operations to disk. It stores
the intermediate processing data in memory.
2. Supports multiple languages: Spark provides built-in APIs in Java, Scala,
or Python. Therefore, you can write applications in different languages.
3. Advanced Analytics: Spark not only supports Map and reduce . It also
supports SQL queries, Streaming data, Machine learning (ML), and Graph
algorithms.
Spark takes MapReduce to the next level with less expensive shuffles in the data
processing. With capabilities like in-memory data storage and near real-time
processing, the performance can be several times faster than other big data
technologies. Spark also supports lazy evaluation of big data queries, which helps
with optimization of the steps in data processing workflows. It provides a higher
level API to improve developer productivity and a consistent architect model for big
data solutions. Spark holds intermediate results in memory rather than writing them
to disk which is very useful especially when you need to work on the same dataset
multiple times. It s designed to be an execution engine that works both in-memory
and on-disk. Spark operators perform external operations when data does not fit in
memory. Spark can be used for processing datasets that larger than the aggregate
memory in a cluster. Spark will attempt to store as much as data in memory and
then will spill to disk. It can store part of a data set in memory and the remaining
data on the disk. You have to look at your data and use cases to assess the memory
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 5
requirements. With this in-memory data storage, Spark comes with performance
advantage. Other Spark features include:
 Supports more than just Map and Reduce functions.
 Optimizes arbitrary operator graphs.
 Lazy evaluation of big data queries which helps with the optimization of
the overall data processing workflow.
 Provides concise and consistent APIs in Scala, Java and Python.
 Offers interactive shell for Scala and Python. This is not available in Java
yet.
Spark is written in Scala Programming Language and runs on Java Virtual Machine
(JVM) environment. It currently supports the following languages for developing
applications using Spark:
1. Scala
2. Java
3. Python
4. R
4 DETAILED DESCRIPTION
Apache Spark consists of several main components including Spark core and
upper-level libraries Spark core runs on different cluster managers and can access
data in any Hadoop data source. In addition, many packages have been built to
work with Spark core and the upper-level libraries.
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 6
4.1. Spark Built on Hadoop
The following diagram shows three ways of how Spark can be built with
Hadoop components.
Fig.1 Spark built with Hadoop components
There are three ways of Spark deployment as explained below.
 Standalone: Spark Standalone deployment means Spark occupies the
place on top of HDFS(Hadoop Distributed File System) and space is
allocated for HDFS, explicitly. Here, Spark and MapReduce will run side
by side to cover all spark jobs on cluster.
 Hadoop Yarn: Hadoop Yarn deployment means, simply, spark runs on
Yarn without any pre-installation or root access required. It helps to
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 7
integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other
components to run on top of stack.
 Spark in MapReduce (SIMR): Spark in MapReduce is used to launch
spark job in addition to standalone deployment. With SIMR, user can
start Spark and uses its shell without any administrative access.
5 COMPONENTS OF SPARK
Fig.2 Components of spark
Apache Spark Core
Spark Core is the underlying general execution engine for spark platform that
all other functionality is built upon. It provides In-Memory computing and
referencing datasets in external storage systems.
Spark SQL
It enables users to run SQL/HQL queries on the top of Spark. Using Apache
Spark SQL, we can process structured as well as semi-structured data. It also
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 8
provides an engine for Hive to run unmodified queries up to 100 times faster
on existing deployments.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform
streaming analytics. It ingests data in mini-batches and performs RDD
(Resilient Distributed Datasets) transformations on those mini-batches of
data.
MLlib (Machine Learning Library)
It is the scalable machine learning library which delivers both
efficiencies as well as the high-quality algorithm. Apache Spark MLlib is one of
the hottest choices for Data Scientist due to its capability of in-memory data
processing, which improves the performance of iterative algorithm drastically.
GraphX
GraphX is a distributed graph-processing framework on top of Spark.
It is the graph computation engine built on top of spark that enables to
process graph data at scale.
6 SPARK ARCHITECTURE
Apache Spark follows a master/slave architecture with two main daemons
and a cluster manager –
i. Master Daemon – (Master/Driver Process)
ii. Worker Daemon –(Slave Process)
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 9
Fig 3 Architecture
A spark cluster has a single Master and any number of Slaves/Workers. The
driver and the executors run their individual Java processes and users can run
them on the same horizontal spark cluster or on separate machines i.e. in a
vertical spark cluster or in mixed machine configuration.
Fig 4 Execution of spark
Role of Driver in Spark Architecture
Spark Driver – Master Node of a Spark Application
It is the central point and the entry point of the Spark Shell (Scala, Python, and R).
The driver program runs the main () function of the application and is the place
where the Spark Context is created. Spark Driver contains various components –
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 10
DAGScheduler, TaskScheduler, BackendScheduler and BlockManager responsible for
the translation of spark user code into actual spark jobs executed on the cluster.
 The driver program that runs on the master node of the spark cluster
schedules the job execution and negotiates with the cluster manager.
 It translates the RDD s into the execution graph and splits the graph into
multiple stages.
 Driver stores the metadata about all the Resilient Distributed Databases
and their partitions.
Role of Executor in Spark Architecture
Executor is a distributed agent responsible for the execution of tasks. Every
spark applications has its own executor process. Executors usually run for the
entire lifetime of a Spark application and this phenomenon is known as Static
Allocation of Executors . However, users can also opt for dynamic allocations
of executors wherein they can add or remove spark executors dynamically to
match with the overall workload.
 Executor performs all the data processing.
 Reads from and Writes data to external sources.
 Executor stores the computation results data in-memory, cache or on hard
disk drives.
 Interacts with the storage systems.
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 11
Role of Cluster Manager in Spark Architecture
Apache Spark is an engine for Big Data processing. One can run Spark on
distributed mode on the cluster. In the cluster, there is master and n number of
workers. It schedules and divides resource in the host machine which forms the
cluster. The prime work of the cluster manager is to divide resources across
applications. It works as an external service for acquiring resources on the cluster.
The cluster manager dispatches work for the cluster. Spark supports pluggable
cluster management. The cluster manager in Spark handles starting executor
processes.
7 RESILENT DISTRIBUTED DATASETS
RDD (Resilient Distributed Dataset) is the fundamental data
structure of Apache Spark which are an immutable collection of objects which
computes on the different node of the cluster. Each and every dataset in Spark
RDD is logically partitioned across many servers so that they can be computed on
different nodes of the cluster.
Decomposing the name RDD:
 Resilient, i.e. fault-tolerant with the help of RDD lineage graph (DAG) and so
able to recompute missing or damaged partitions due to node failures.
 Distributed, since Data resides on multiple nodes.
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 12
 Dataset represents records of the data you work with. The user can load the
data set externally which can be either JSON file, CSV file, text file or database
via JDBC with no specific data structure.
Hence, each and every dataset in RDD is logically partitioned across many servers
so that they can be computed on different nodes of the cluster. RDDs are fault
tolerant i.e. It posses self-recovery in the case of failure
Spark makes use of the concept of RDD to achieve faster and
efficient MapReduce operations. Let us first discuss how MapReduce operations
take place and why they are not so efficient.
7.1 Features of RDD in Spark
Several features of Apache Spark RDD are:
In-memory Computation
Spark RDDs have a provision of in-memory computation. It stores
intermediate results in distributed memory (RAM) instead of stable storage
(disk).
Lazy Evaluations
All transformations in Apache Spark are lazy, in that they do not
compute their results right away. Instead, they just remember the
transformations applied to some base data set.
Spark computes transformations when an action requires a result for the driver
program. Follow this guide for the deep study of Spark Lazy Evaluation.
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 13
Fault Tolerance
Spark RDDs are fault tolerant as they track data lineage information
to rebuild lost data automatically on failure. They rebuild lost data on failure
using lineage, each RDD remembers how it was created from other datasets (by
transformations like a map, join or groupBy) to recreate itself. Follow this guide
for the deep study of RDD Fault Tolerance.
Immutability
Data is safe to share across processes. It can also be created or
retrieved anytime which makes caching, sharing & replication easy. Thus, it is a
way to reach consistency in computations.
Partitioning
Partitioning is the fundamental unit of parallelism in Spark RDD.
Each partition is one logical division of data which is mutable. One can create a
partition through some transformations on existing partitions.
5Persistence
Users can state which RDDs they will reuse and choose a storage
strategy for them (e.g., in-memory storage or on Disk).
7.2 Spark RDD Operations
RDD in Apache Spark supports two types of operations:
 Transformation
 Actions
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 14
Transformations
Spark RDD Transformations are functions that take an RDD as the input and
produce one or many RDDs as the output. They do not change the input RDD
(since RDDs are immutable and hence one cannot change it), but always produce
one or more new RDDs by applying the computations they represent e.g. Map(),
filter(), reduceByKey() etc.
Transformations are lazy operations on an RDD in Apache Spark. It creates
one or many new RDDs, which executes when an Action occurs. Hence,
Transformation creates a new dataset from an existing one.Certain
transformations can be pipelined which is an optimization method, that Spark
uses to improve the performance of computations. There are two kinds of
transformations: narrow transformation, wide transformation.
Actions
An Action in Spark returns final result of RDD computations. It triggers
execution using lineage graph to load the data into original RDD, carry out all
intermediate transformations and return final results to Driver program or write
RDD operations
Transformations Actions
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 15
it out to file system. Lineage graph is dependency graph of all parallel RDDs of
RDD.
Actions are RDD operations that produce non-RDD values. They
materialize a value in a Spark program. An Action is one of the ways to send
result from executors to the driver. First(), take(), reduce(), collect(), the count()
is some of the Actions in spark.
Using transformations, one can create RDD from the existing one. But
when we want to work with the actual dataset, at that point we use Action.
When the Action occurs it does not create the new RDD, unlike transformation.
Thus, actions are RDD operations that give no RDD values. Action stores its value
either to drivers or to the external storage system. It brings laziness of RDD into
motion.
7.3 Lazy Evaluation
lazy evaluation in Spark means that the execution will not start until an
action is triggered. In Spark, the picture of lazy evaluation comes when Spark
transformations occur. Transformations are lazy in nature meaning when we call
some operation in RDD, it does not execute immediately. Spark maintains the
record of which operation is being called. We can think Spark RDD as the data,
that we built up through transformation. Since transformations are lazy in
nature, so we can execute operation any time by calling an action on data.
Hence, in lazy evaluation data is not loaded until it is necessary
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 16
7.4 Data Sharing Using Spark Rdd
Data sharing is slow in MapReduce due to replication, serialization, and
disk IO. Most of the Hadoop applications, they spend more than 90% of the time
doing HDFS read-write operations.
Recognizing this problem, researchers developed a specialized framework
called Apache Spark. The key idea of spark is Resilient Distributed Datasets
(RDD); it supports in-memory processing computation. This means, it stores the
state of memory as an object across the jobs and the object is sharable between
those jobs. Data sharing in memory is 10 to 100 times faster than network and
Disk.
Let us now try to find out how iterative and interactive operations take place in
Spark RDD.
7.5 Iterative Operations on Spark RDD
The illustration given below shows the iterative operations on Spark RDD.
It will store intermediate results in a distributed memory instead of Stable
storage (Disk) and make the system faster.
Note: If the Distributed memory (RAM) is sufficient to store intermediate results
(State of the JOB), then it will store those results on the disk.
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 17
Fig 5: Iterative operations on Spark RDD
7.6 Interactive Operations on Spark RDD
This illustration shows interactive operations on Spark RDD. If different
queries are run on the same set of data repeatedly, this particular data can be
kept in memory for better execution times.
Fig 6 Interactive operations on Spark RDD
By default, each transformed RDD may be recomputed each time you run an
action on it. However, you may also persist an RDD in memory, in which case
Spark will keep the elements around on the cluster for much faster access, the
next time you query it. There is also support for persisting RDDs on disk, or
replicated across multiple nodes.
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 18
8 Spark Context
SparkContext is the entry gate of Apache Spark functionality. The most
important step of any Spark driver application is to generate SparkContext. It
allows your Spark Application to access Spark Cluster with the help of Resource
Manager. It interacts with cluster manager.
Fig. spark context
9 SPARK EXECUTION
When a client submits a spark user application code, the driver implicitly
converts the code containing transformations and actions into a logical directed
acyclic graph. At this stage, the driver program also performs certain
optimizations like pipelining transformations and then it converts the logical DAG
into physical execution plan with set of stages. After creating the physical
execution plan, it creates small physical execution units referred to as tasks
under each stage. Then tasks are bundled to be sent to the Spark Cluster.
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 19
The driver program then talks to the cluster manager and negotiates for
resources. The cluster manager then launches executors on the worker nodes on
behalf of the driver. At this point the driver sends tasks to the cluster manager
based on data placement. Before executors begin execution, they register
themselves with the driver program so that the driver has holistic view of all the
executors. Now executors start executing the various tasks assigned by the driver
program. At any point of time when the spark application is running, the driver
program will monitor the set of executors that run. Driver program in the spark
architecture also schedules future tasks based on data placement by tracking the
location of cached data. When driver programs main () method exits or when it
call the stop () method of the Spark Context, it will terminate all the executors
and release the resources from the cluster manager.
10 APACHE SPARK USE CASES
The creators of Apache Spark polled a survey on Why companies should use
in-memory computing framework like Apache Spark? and the results of the
survey are overwhelming –
91% use Apache Spark because of its performance gains.
77% use Apache Spark as it is easy to use.
71% use Apache Spark due to the ease of deployment.
64% use Apache Spark to leverage advanced analytics
52% use Apache Spark for real-time streaming.
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 20
Apache Spark at Alibaba
One of the world s largest e-commerce platforms Alibaba Taobao runs some
of the largest Apache Spark jobs in the world in order to analyse hundreds of
petabytes of data on its ecommerce platform. Some of the Spark jobs that
perform feature extraction on image data, run for several weeks. Millions of
merchants and users interact with Alibaba Taobao s ecommerce platform. Each
of these interactions is represented as a complicated large graph and apache
spark is used for fast processing of sophisticated machine learning on this data.
Apache Spark at eBay
eBay uses Apache Spark to provide targeted offers, enhance customer
experience, and to optimize the overall performance. Apache Spark is leveraged
at eBay through Hadoop YARN.YARN manages all the cluster resources to run
generic tasks. EBay spark users leverage the Hadoop clusters in the range of 2000
nodes, 20,000 cores and 100TB of RAM through YARN.
Apache Spark at MyFitnessPal
The largest health and fitness community MyFitnessPal helps people achieve
a healthy lifestyle through better diet and exercise. MyFitnessPal uses apache
spark to clean the data entered by users with the end goal of identifying high
quality food items. Using Spark, MyFitnessPal has been able to scan through food
calorie data of about 80 million users. Earlier, MyFitnessPal used Hadoop to
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 21
process 2.5TB of data and that took several days to identify any errors or missing
information in it.
Apache Spark at Yahoo for News Personalization
Yahoo uses Apache Spark for personalizing its news web pages and for
targeted advertising. It uses machine learning algorithms that run on Apache
Spark to find out what kind of news - users are interested to read and
categorizing the news stories to find out what kind of users would be interested
in reading each category of news.
Earlier the machine learning algorithm for news personalization required
15000 lines of C++ code but now with Spark Scala the machine learning algorithm
for news personalization has just 120 lines of Scala programming code. The
algorithm was ready for production use in just 30 minutes of training, on a
hundred million datasets.
Apache Spark at Conviva
The largest streaming video company Conviva uses Apache Spark to deliver
quality of service to its customers by removing the screen buffering and learning
in detail about the network conditions in real-time. This information is stored in
the video player to manage live video traffic coming from close to 4 billion video
feeds every month, to ensure maximum play-through. Apache Spark is helping
Conviva reduce its customer churn to a great extent by providing its customers
with a smooth video viewing experience.
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 22
Apache Spark at Netflix
Netflix uses Apache Spark for real-time stream processing to provide online
recommendations to its customers. Streaming devices at Netflix send events
which capture all member activities and play a vital role in personalization. It
processes 450 billion events per day which flow to server side applications and
are directed to Apache Kafka.
Apache Spark at Pinterest
Pinterest is using apache spark to discover trends in high value user
engagement data so that it can react to developing trends in real-time by getting
an in-depth understanding of user behaviour on the website.
Apache Spark at TripAdvisor
TripAdvisor, a leading travel website that helps users plan a perfect trip is
using Apache Spark to speed up its personalized customer recommendations.
TripAdvisor uses apache spark to provide advice to millions of travellers by
comparing hundreds of websites to find the best hotel prices for its customers.
The time taken to read and process the reviews of the hotels in a readable
format is done with the help of Apache Spark.
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 23
11 DISADVANTAGES
Near real time processing
In Spark Streaming, the arriving live stream of data is divided into batches of
the pre-defined interval, and each batch of data is treated like Spark Resilient
Distributed Database (RDDs). Then these RDDs are processed using the operations
like map, reduce, join etc. The result of these operations is returned in batches.
Thus, it is not real time processing but Spark is near real-time processing of live
data. Micro-batch processing takes place in Spark Streaming
No file management system
Apache Spark does not have its own file management system, thus it relies on
some other platform like Hadoop or another cloud-based platform which is one of
the Spark known issues.
Expensive
In-memory capability can become a bottleneck when we want cost-efficient
processing of big data as keeping data in memory is quite expensive, the memory
consumption is very high, and it is not handled in a user-friendly manner. Apache
Spark requires lots of RAM to run in-memory, thus the cost of Spark is quite high.
Although Spark has many drawbacks, it is still popular in the market
for big data solution. But there are various technologies that are overtaking Spark.
Like stream processing is much better using Flink then Spark as it is real time
processing.
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 24
12 APACHE SPARK VERSIONS
Version Original
release date
Latest
version
Release
date
0.5 2012-06-12 0.5.1 2012-10-07
0.6 2012-10-14 0.6.2 2013-02-07
0.7 2013-02-27 0.7.3 2013-07-16
0.8 2013-09-25 0.8.1 2013-12-19
0.9 2014-02-02 0.9.2 2014-07-23
1.0 2014-05-30 1.0.2 2014-08-05
1.1 2014-09-11 1.1.1 2014-11-26
1.2 2014-12-18 1.2.2 2015-04-17
1.3 2015-03-13 1.3.1 2015-04-17
1.4 2015-06-11 1.4.1 2015-07-15
1.5 2015-09-09 1.5.2 2015-11-09
1.6 2016-01-04 1.6.3 2016-11-07
2.0 2016-07-26 2.0.2 2016-11-14
2.1 2016-12-28 2.1.1 2017-05-02
2.2 2017-07-11 2.2.0 2017-07-11
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 25
13 CONCLUTION AND FUTURE SCOPE
Apache Spark is opening up various opportunities for big data examination and
making it easier for administrations to solve different kinds of big data problems.
Spark is the hottest technology now, not simply among the data engineers however,
the even majority of data scientists highly prefer to work with Spark. Apache Spark
is a fascinating platform for data scientists with use cases spanning across
exploratory and operational analytics.
Apache Spark has witnessed continuous upward trajectory in the big data
ecosystem. With IBM s recent announcement that it'll educate quite one million
data engineers and data scientists on Apache Spark – 2016 is unquestionably THE
year to be told Spark and pursue a profitable career.
APACHE SPARK
DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 26
14 REFERENCES
1) https://spark.apache.org/
2) https://en.wikipedia.org/wiki/Apache_Spark
3) https://www.infoq.com/articles/
4) https://www.tutorialspoint.com/apache_spark/
5) https://data-flair
6) https://www.dezyre.com/

More Related Content

What's hot

Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistenceVenkat Datla
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at FacebookDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet FormatYue Chen
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Kai Wähner
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 

What's hot (20)

Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet Format
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 

Similar to Apache spark

Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache SparkEdureka!
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With SparkEdureka!
 
Apachespark 160612140708
Apachespark 160612140708Apachespark 160612140708
Apachespark 160612140708Srikrishna k
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]Shweta Patnaik
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Jyotasana Bharti
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!Edureka!
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!Edureka!
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureJen Stirrup
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 

Similar to Apache spark (20)

SparkPaper
SparkPaperSparkPaper
SparkPaper
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
 
Apachespark 160612140708
Apachespark 160612140708Apachespark 160612140708
Apachespark 160612140708
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Module01
 Module01 Module01
Module01
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Apache Spark Notes
Apache Spark NotesApache Spark Notes
Apache Spark Notes
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
 
Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 

Recently uploaded

Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 

Recently uploaded (20)

Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 

Apache spark

  • 1. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 1 1 INTRODUCTION Industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop framework is based on a simple programming model (MapReduce) and it enables a computing solution that is scalable, flexible, fault- tolerant and cost effective. Here, the main concern is to maintain speed in processing large datasets in terms of waiting time between queries and waiting time to run the program. Spark was introduced by Apache Software Foundation for speeding up the Hadoop computational computing software process. As against a common belief, Spark is not a modified version of Hadoop and is not, really, dependent on Hadoop because it has its own cluster management. Hadoop is just one of the ways to implement Spark. Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its own cluster management computation, it uses Hadoop for storage purpose only. Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley s AMPLab, and open sourced in 2010 as an Apache project. Spark has several advantages compared to other big data and MapReduce technologies like Hadoop and Storm. First of all, Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as
  • 2. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 2 well as the source of data (batch v. real-time streaming data). Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk. Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high level operators. And you can use it interactively to query data within the shell. In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case. Apache Spark provides programmers with an application programming interface cantered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory. The availability of RDDs facilitates the implementation of both iterative algorithms, that visit their dataset multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeated database-style querying of data. The latency of such applications (compared to a MapReduce implementation, as was common in Apache Hadoop stacks) may be reduced by several orders of magnitude. Among the class of iterative algorithms are
  • 3. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 3 the training algorithms for machine learning systems, which formed the initial impetus for developing Apache Spark. Apache Spark requires a cluster manager and a distributed storage system. For cluster management, Spark supports standalone (native Spark cluster), Hadoop YARN, or Apache Mesos. For distributed storage, Spark can interface with a wide variety, including Hadoop Distributed File System (HDFS), MapR File System (MapR-FS), Cassandra, OpenStack Swift, Amazon S3, Kudu, or a custom solution can be implemented. Spark also supports a pseudo- distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in such a scenario, Spark is run on a single machine with one executor per CPU core. 2 EVOLUTION OF APACHE SPARK Spark is one of Hadoop s sub project developed in 2009 in UC Berkeley s AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-2014. 3 KEY FEATURES Apache Spark has following features. 1. Speed: Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is
  • 4. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 4 possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory. 2. Supports multiple languages: Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages. 3. Advanced Analytics: Spark not only supports Map and reduce . It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms. Spark takes MapReduce to the next level with less expensive shuffles in the data processing. With capabilities like in-memory data storage and near real-time processing, the performance can be several times faster than other big data technologies. Spark also supports lazy evaluation of big data queries, which helps with optimization of the steps in data processing workflows. It provides a higher level API to improve developer productivity and a consistent architect model for big data solutions. Spark holds intermediate results in memory rather than writing them to disk which is very useful especially when you need to work on the same dataset multiple times. It s designed to be an execution engine that works both in-memory and on-disk. Spark operators perform external operations when data does not fit in memory. Spark can be used for processing datasets that larger than the aggregate memory in a cluster. Spark will attempt to store as much as data in memory and then will spill to disk. It can store part of a data set in memory and the remaining data on the disk. You have to look at your data and use cases to assess the memory
  • 5. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 5 requirements. With this in-memory data storage, Spark comes with performance advantage. Other Spark features include:  Supports more than just Map and Reduce functions.  Optimizes arbitrary operator graphs.  Lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow.  Provides concise and consistent APIs in Scala, Java and Python.  Offers interactive shell for Scala and Python. This is not available in Java yet. Spark is written in Scala Programming Language and runs on Java Virtual Machine (JVM) environment. It currently supports the following languages for developing applications using Spark: 1. Scala 2. Java 3. Python 4. R 4 DETAILED DESCRIPTION Apache Spark consists of several main components including Spark core and upper-level libraries Spark core runs on different cluster managers and can access data in any Hadoop data source. In addition, many packages have been built to work with Spark core and the upper-level libraries.
  • 6. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 6 4.1. Spark Built on Hadoop The following diagram shows three ways of how Spark can be built with Hadoop components. Fig.1 Spark built with Hadoop components There are three ways of Spark deployment as explained below.  Standalone: Spark Standalone deployment means Spark occupies the place on top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobs on cluster.  Hadoop Yarn: Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre-installation or root access required. It helps to
  • 7. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 7 integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other components to run on top of stack.  Spark in MapReduce (SIMR): Spark in MapReduce is used to launch spark job in addition to standalone deployment. With SIMR, user can start Spark and uses its shell without any administrative access. 5 COMPONENTS OF SPARK Fig.2 Components of spark Apache Spark Core Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. It provides In-Memory computing and referencing datasets in external storage systems. Spark SQL It enables users to run SQL/HQL queries on the top of Spark. Using Apache Spark SQL, we can process structured as well as semi-structured data. It also
  • 8. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 8 provides an engine for Hive to run unmodified queries up to 100 times faster on existing deployments. Spark Streaming Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data. MLlib (Machine Learning Library) It is the scalable machine learning library which delivers both efficiencies as well as the high-quality algorithm. Apache Spark MLlib is one of the hottest choices for Data Scientist due to its capability of in-memory data processing, which improves the performance of iterative algorithm drastically. GraphX GraphX is a distributed graph-processing framework on top of Spark. It is the graph computation engine built on top of spark that enables to process graph data at scale. 6 SPARK ARCHITECTURE Apache Spark follows a master/slave architecture with two main daemons and a cluster manager – i. Master Daemon – (Master/Driver Process) ii. Worker Daemon –(Slave Process)
  • 9. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 9 Fig 3 Architecture A spark cluster has a single Master and any number of Slaves/Workers. The driver and the executors run their individual Java processes and users can run them on the same horizontal spark cluster or on separate machines i.e. in a vertical spark cluster or in mixed machine configuration. Fig 4 Execution of spark Role of Driver in Spark Architecture Spark Driver – Master Node of a Spark Application It is the central point and the entry point of the Spark Shell (Scala, Python, and R). The driver program runs the main () function of the application and is the place where the Spark Context is created. Spark Driver contains various components –
  • 10. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 10 DAGScheduler, TaskScheduler, BackendScheduler and BlockManager responsible for the translation of spark user code into actual spark jobs executed on the cluster.  The driver program that runs on the master node of the spark cluster schedules the job execution and negotiates with the cluster manager.  It translates the RDD s into the execution graph and splits the graph into multiple stages.  Driver stores the metadata about all the Resilient Distributed Databases and their partitions. Role of Executor in Spark Architecture Executor is a distributed agent responsible for the execution of tasks. Every spark applications has its own executor process. Executors usually run for the entire lifetime of a Spark application and this phenomenon is known as Static Allocation of Executors . However, users can also opt for dynamic allocations of executors wherein they can add or remove spark executors dynamically to match with the overall workload.  Executor performs all the data processing.  Reads from and Writes data to external sources.  Executor stores the computation results data in-memory, cache or on hard disk drives.  Interacts with the storage systems.
  • 11. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 11 Role of Cluster Manager in Spark Architecture Apache Spark is an engine for Big Data processing. One can run Spark on distributed mode on the cluster. In the cluster, there is master and n number of workers. It schedules and divides resource in the host machine which forms the cluster. The prime work of the cluster manager is to divide resources across applications. It works as an external service for acquiring resources on the cluster. The cluster manager dispatches work for the cluster. Spark supports pluggable cluster management. The cluster manager in Spark handles starting executor processes. 7 RESILENT DISTRIBUTED DATASETS RDD (Resilient Distributed Dataset) is the fundamental data structure of Apache Spark which are an immutable collection of objects which computes on the different node of the cluster. Each and every dataset in Spark RDD is logically partitioned across many servers so that they can be computed on different nodes of the cluster. Decomposing the name RDD:  Resilient, i.e. fault-tolerant with the help of RDD lineage graph (DAG) and so able to recompute missing or damaged partitions due to node failures.  Distributed, since Data resides on multiple nodes.
  • 12. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 12  Dataset represents records of the data you work with. The user can load the data set externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure. Hence, each and every dataset in RDD is logically partitioned across many servers so that they can be computed on different nodes of the cluster. RDDs are fault tolerant i.e. It posses self-recovery in the case of failure Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. Let us first discuss how MapReduce operations take place and why they are not so efficient. 7.1 Features of RDD in Spark Several features of Apache Spark RDD are: In-memory Computation Spark RDDs have a provision of in-memory computation. It stores intermediate results in distributed memory (RAM) instead of stable storage (disk). Lazy Evaluations All transformations in Apache Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base data set. Spark computes transformations when an action requires a result for the driver program. Follow this guide for the deep study of Spark Lazy Evaluation.
  • 13. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 13 Fault Tolerance Spark RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure. They rebuild lost data on failure using lineage, each RDD remembers how it was created from other datasets (by transformations like a map, join or groupBy) to recreate itself. Follow this guide for the deep study of RDD Fault Tolerance. Immutability Data is safe to share across processes. It can also be created or retrieved anytime which makes caching, sharing & replication easy. Thus, it is a way to reach consistency in computations. Partitioning Partitioning is the fundamental unit of parallelism in Spark RDD. Each partition is one logical division of data which is mutable. One can create a partition through some transformations on existing partitions. 5Persistence Users can state which RDDs they will reuse and choose a storage strategy for them (e.g., in-memory storage or on Disk). 7.2 Spark RDD Operations RDD in Apache Spark supports two types of operations:  Transformation  Actions
  • 14. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 14 Transformations Spark RDD Transformations are functions that take an RDD as the input and produce one or many RDDs as the output. They do not change the input RDD (since RDDs are immutable and hence one cannot change it), but always produce one or more new RDDs by applying the computations they represent e.g. Map(), filter(), reduceByKey() etc. Transformations are lazy operations on an RDD in Apache Spark. It creates one or many new RDDs, which executes when an Action occurs. Hence, Transformation creates a new dataset from an existing one.Certain transformations can be pipelined which is an optimization method, that Spark uses to improve the performance of computations. There are two kinds of transformations: narrow transformation, wide transformation. Actions An Action in Spark returns final result of RDD computations. It triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write RDD operations Transformations Actions
  • 15. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 15 it out to file system. Lineage graph is dependency graph of all parallel RDDs of RDD. Actions are RDD operations that produce non-RDD values. They materialize a value in a Spark program. An Action is one of the ways to send result from executors to the driver. First(), take(), reduce(), collect(), the count() is some of the Actions in spark. Using transformations, one can create RDD from the existing one. But when we want to work with the actual dataset, at that point we use Action. When the Action occurs it does not create the new RDD, unlike transformation. Thus, actions are RDD operations that give no RDD values. Action stores its value either to drivers or to the external storage system. It brings laziness of RDD into motion. 7.3 Lazy Evaluation lazy evaluation in Spark means that the execution will not start until an action is triggered. In Spark, the picture of lazy evaluation comes when Spark transformations occur. Transformations are lazy in nature meaning when we call some operation in RDD, it does not execute immediately. Spark maintains the record of which operation is being called. We can think Spark RDD as the data, that we built up through transformation. Since transformations are lazy in nature, so we can execute operation any time by calling an action on data. Hence, in lazy evaluation data is not loaded until it is necessary
  • 16. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 16 7.4 Data Sharing Using Spark Rdd Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations. Recognizing this problem, researchers developed a specialized framework called Apache Spark. The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing computation. This means, it stores the state of memory as an object across the jobs and the object is sharable between those jobs. Data sharing in memory is 10 to 100 times faster than network and Disk. Let us now try to find out how iterative and interactive operations take place in Spark RDD. 7.5 Iterative Operations on Spark RDD The illustration given below shows the iterative operations on Spark RDD. It will store intermediate results in a distributed memory instead of Stable storage (Disk) and make the system faster. Note: If the Distributed memory (RAM) is sufficient to store intermediate results (State of the JOB), then it will store those results on the disk.
  • 17. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 17 Fig 5: Iterative operations on Spark RDD 7.6 Interactive Operations on Spark RDD This illustration shows interactive operations on Spark RDD. If different queries are run on the same set of data repeatedly, this particular data can be kept in memory for better execution times. Fig 6 Interactive operations on Spark RDD By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory, in which case Spark will keep the elements around on the cluster for much faster access, the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
  • 18. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 18 8 Spark Context SparkContext is the entry gate of Apache Spark functionality. The most important step of any Spark driver application is to generate SparkContext. It allows your Spark Application to access Spark Cluster with the help of Resource Manager. It interacts with cluster manager. Fig. spark context 9 SPARK EXECUTION When a client submits a spark user application code, the driver implicitly converts the code containing transformations and actions into a logical directed acyclic graph. At this stage, the driver program also performs certain optimizations like pipelining transformations and then it converts the logical DAG into physical execution plan with set of stages. After creating the physical execution plan, it creates small physical execution units referred to as tasks under each stage. Then tasks are bundled to be sent to the Spark Cluster.
  • 19. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 19 The driver program then talks to the cluster manager and negotiates for resources. The cluster manager then launches executors on the worker nodes on behalf of the driver. At this point the driver sends tasks to the cluster manager based on data placement. Before executors begin execution, they register themselves with the driver program so that the driver has holistic view of all the executors. Now executors start executing the various tasks assigned by the driver program. At any point of time when the spark application is running, the driver program will monitor the set of executors that run. Driver program in the spark architecture also schedules future tasks based on data placement by tracking the location of cached data. When driver programs main () method exits or when it call the stop () method of the Spark Context, it will terminate all the executors and release the resources from the cluster manager. 10 APACHE SPARK USE CASES The creators of Apache Spark polled a survey on Why companies should use in-memory computing framework like Apache Spark? and the results of the survey are overwhelming – 91% use Apache Spark because of its performance gains. 77% use Apache Spark as it is easy to use. 71% use Apache Spark due to the ease of deployment. 64% use Apache Spark to leverage advanced analytics 52% use Apache Spark for real-time streaming.
  • 20. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 20 Apache Spark at Alibaba One of the world s largest e-commerce platforms Alibaba Taobao runs some of the largest Apache Spark jobs in the world in order to analyse hundreds of petabytes of data on its ecommerce platform. Some of the Spark jobs that perform feature extraction on image data, run for several weeks. Millions of merchants and users interact with Alibaba Taobao s ecommerce platform. Each of these interactions is represented as a complicated large graph and apache spark is used for fast processing of sophisticated machine learning on this data. Apache Spark at eBay eBay uses Apache Spark to provide targeted offers, enhance customer experience, and to optimize the overall performance. Apache Spark is leveraged at eBay through Hadoop YARN.YARN manages all the cluster resources to run generic tasks. EBay spark users leverage the Hadoop clusters in the range of 2000 nodes, 20,000 cores and 100TB of RAM through YARN. Apache Spark at MyFitnessPal The largest health and fitness community MyFitnessPal helps people achieve a healthy lifestyle through better diet and exercise. MyFitnessPal uses apache spark to clean the data entered by users with the end goal of identifying high quality food items. Using Spark, MyFitnessPal has been able to scan through food calorie data of about 80 million users. Earlier, MyFitnessPal used Hadoop to
  • 21. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 21 process 2.5TB of data and that took several days to identify any errors or missing information in it. Apache Spark at Yahoo for News Personalization Yahoo uses Apache Spark for personalizing its news web pages and for targeted advertising. It uses machine learning algorithms that run on Apache Spark to find out what kind of news - users are interested to read and categorizing the news stories to find out what kind of users would be interested in reading each category of news. Earlier the machine learning algorithm for news personalization required 15000 lines of C++ code but now with Spark Scala the machine learning algorithm for news personalization has just 120 lines of Scala programming code. The algorithm was ready for production use in just 30 minutes of training, on a hundred million datasets. Apache Spark at Conviva The largest streaming video company Conviva uses Apache Spark to deliver quality of service to its customers by removing the screen buffering and learning in detail about the network conditions in real-time. This information is stored in the video player to manage live video traffic coming from close to 4 billion video feeds every month, to ensure maximum play-through. Apache Spark is helping Conviva reduce its customer churn to a great extent by providing its customers with a smooth video viewing experience.
  • 22. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 22 Apache Spark at Netflix Netflix uses Apache Spark for real-time stream processing to provide online recommendations to its customers. Streaming devices at Netflix send events which capture all member activities and play a vital role in personalization. It processes 450 billion events per day which flow to server side applications and are directed to Apache Kafka. Apache Spark at Pinterest Pinterest is using apache spark to discover trends in high value user engagement data so that it can react to developing trends in real-time by getting an in-depth understanding of user behaviour on the website. Apache Spark at TripAdvisor TripAdvisor, a leading travel website that helps users plan a perfect trip is using Apache Spark to speed up its personalized customer recommendations. TripAdvisor uses apache spark to provide advice to millions of travellers by comparing hundreds of websites to find the best hotel prices for its customers. The time taken to read and process the reviews of the hotels in a readable format is done with the help of Apache Spark.
  • 23. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 23 11 DISADVANTAGES Near real time processing In Spark Streaming, the arriving live stream of data is divided into batches of the pre-defined interval, and each batch of data is treated like Spark Resilient Distributed Database (RDDs). Then these RDDs are processed using the operations like map, reduce, join etc. The result of these operations is returned in batches. Thus, it is not real time processing but Spark is near real-time processing of live data. Micro-batch processing takes place in Spark Streaming No file management system Apache Spark does not have its own file management system, thus it relies on some other platform like Hadoop or another cloud-based platform which is one of the Spark known issues. Expensive In-memory capability can become a bottleneck when we want cost-efficient processing of big data as keeping data in memory is quite expensive, the memory consumption is very high, and it is not handled in a user-friendly manner. Apache Spark requires lots of RAM to run in-memory, thus the cost of Spark is quite high. Although Spark has many drawbacks, it is still popular in the market for big data solution. But there are various technologies that are overtaking Spark. Like stream processing is much better using Flink then Spark as it is real time processing.
  • 24. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 24 12 APACHE SPARK VERSIONS Version Original release date Latest version Release date 0.5 2012-06-12 0.5.1 2012-10-07 0.6 2012-10-14 0.6.2 2013-02-07 0.7 2013-02-27 0.7.3 2013-07-16 0.8 2013-09-25 0.8.1 2013-12-19 0.9 2014-02-02 0.9.2 2014-07-23 1.0 2014-05-30 1.0.2 2014-08-05 1.1 2014-09-11 1.1.1 2014-11-26 1.2 2014-12-18 1.2.2 2015-04-17 1.3 2015-03-13 1.3.1 2015-04-17 1.4 2015-06-11 1.4.1 2015-07-15 1.5 2015-09-09 1.5.2 2015-11-09 1.6 2016-01-04 1.6.3 2016-11-07 2.0 2016-07-26 2.0.2 2016-11-14 2.1 2016-12-28 2.1.1 2017-05-02 2.2 2017-07-11 2.2.0 2017-07-11
  • 25. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 25 13 CONCLUTION AND FUTURE SCOPE Apache Spark is opening up various opportunities for big data examination and making it easier for administrations to solve different kinds of big data problems. Spark is the hottest technology now, not simply among the data engineers however, the even majority of data scientists highly prefer to work with Spark. Apache Spark is a fascinating platform for data scientists with use cases spanning across exploratory and operational analytics. Apache Spark has witnessed continuous upward trajectory in the big data ecosystem. With IBM s recent announcement that it'll educate quite one million data engineers and data scientists on Apache Spark – 2016 is unquestionably THE year to be told Spark and pursue a profitable career.
  • 26. APACHE SPARK DEPARTMENT OF COMPUTER SCIENCE & APPLICATIONS, SJCET, PALAI Page: 26 14 REFERENCES 1) https://spark.apache.org/ 2) https://en.wikipedia.org/wiki/Apache_Spark 3) https://www.infoq.com/articles/ 4) https://www.tutorialspoint.com/apache_spark/ 5) https://data-flair 6) https://www.dezyre.com/