Module01

Naveen P.N
Trainer
Module 01 – Apache Spark Introduction
NPN TrainingTraining is the essence of success and we are committed to it
www.npntraining.com

Topics for the Module `
History of Apache Spark
Batch VS Real-Time Processing
Limitation of MapReduce
Introduction to Apache Spark
Features of Apache Spark
Data Sharing in MapReduce
Data Sharing in Apache Spark
Understanding Spark Deployment
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
After completing the module, you will be able to understand:
Overview of Spark Eco-System
Overview of Spark Architecture
Understanding Spark Modes
Hadoop VS Spark
Eco-System of Hadoop VS Spark
Spark Use-cases
Introduction to RDD
RDD Traits

History of Spark
Started at UC
Berkeley AMPLab
by MateiZaharia
Open sourced
under a BSD
license
The project was given to
the Apache Foundation
and the license was
changed to Apache 2.0
2009
2009
2009
2014
Present
Exists as a next generation
real-time and batch
processing framework
Became an Apache Top-
Level Project. Used by the
engineering team at
Databricks to set a world
record in large-scale.

Batch VS Real-Time Processing ``
The features below show a comparison of batch and real-time analytics in the enterprise use cases:
1. Data processing takes place upon data entry or
command receipt instantaneously.
2. It must execute on response time within stringent
constraints.
Example : Fraud detection.
1. Large group of data/transactions is processed in a
single run.
2. Job run without any manual intervention.
3. The entire data is pre-selected and fed using
command line parameters and scripts.
4. It is used to execute multiple operations, handle
heavy data load, reporting, and offline data
workflow
Example : Regular, Weekly report required for decision
making
Batch Processing Real-Time Processing

Limitation of MapReduce ``
The limitation of MapReduce in Hadoop are listed below:
Unsuitable in real-time processing
Being batch oriented, it takes minutes to execute jobs depending upon the amount of data and number of nodes in the cluster.
Unsuitable for trivial operations
For operations like Filtering and Joins , you might need to rewrite the jobs, which becomes complex because of key-value pairs.
Unfit for large data on network
However it works on the data locality principle, it cannot process a lot of data requiring shuffling over the network well.

Limitation of MapReduce Contd… ``
Unfit for processing graph
The Apache Giraph library processes graphs, which adds additional complexity on top of MapReduce.
Unfit for iterative execution
Being a state-less execution, MapReduce doesn’t fit with use cases like Kmeans that need iterative execution.
Unsuitable with OLTP (Online Transaction Processing)
OLTP requires a large number of short transactions, as it works on batch-oriented framework.

Introduction to Spark
Apache Spark is a lightning-fast cluster computing framework, designed for fast computation.
Spark uses Hadoop in two ways
1. one is storage
2. second is processing
Since Spark has its own cluster management computation, it uses Hadoop for storage purpose only.
As against a common belief, Spark is not a modified version of Hadoop, and is really not dependent on
Hadoop because it has its own cluster management. Hadoop is just one of the ways to implement
Spark.
Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms ,
interactive queries and streaming.
The main feature of Spark is its in-memory cluster computing that increases the processing speed of
an application.

Features of Apache Spark
Speed - Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10
times faster when running on disk. This is possible by reducing number of read/write operations to disk. It
stores the intermediate processing data in memory.
Supports multiple languages - Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can
write applications in different languages. Spark comes up with 80 high-level operators for interactive
querying.
Advance Analytics - Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming
data, Machine learning (ML), and Graph algorithms.
Apache Spark has following features.
www.npntraining.com/courses/apache-spark-scala-training.php

Ideal Solution for Big Data Analytics
Batch
Streaming Interactive
Single
Framework
Batch
The collection and storage for data, for processing
at a scheduled time when a sufficient amount of
data has been accumulated.
Streaming
Continual streaming of data.
Interactive ecosystem or Hadoop stack. It
allows other components to run on top of
stack.
``

Data Sharing in Map Reduce
HDFS
M
A
P
P
E
R
SHUFFLE
&
SORT
R
E
D
U
C
E
R
M
A
P
P
E
R
SHUFFLE
&
SORT
R
E
D
U
C
E
R
HDFS
IO Operation IO Operation IO Operation IO Operation
IO Operation IO Operation IO Operation IO Operation

In MapReduce lots of IO operation happens to process the data so it is not good for intensive data iterative algorithms .
query1
query2
query3
HDFS
One-time
processing
Distributed Memory
10 – 100x faster than
network and disk

Spark Built on Hadoop
Three ways of how Spark can be built with Hadoop components.
Spark
HDFS
Standalone
HDFS HDFS
Spark
YARN
HDFS
Spark
Hadoop 2.x YARN Hadoop V1 (SIMR)

Spark Deployment
There are three ways of Spark deployment
Standalone - Spark Standalone deployment means Spark occupies the place on top of HDFS(Hadoop
Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark and MapReduce will run side by
side to cover all spark jobs on cluster.
Hadoop YARN - Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre-installation or
root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other
components to run on top of stack.
Spark in MapReduce (SIMR) - Spark in MapReduce is used to launch spark job in addition to standalone
deployment. With SIMR, user can start Spark and uses its shell without any administrative access.
``

Spark Ecosystem
Programming
Library
Engine
Management
Storage
Apache Spark Core Engine
YARN Mesos Spark Scheduler
Local HDFS S3 RDBMS NoSQL
Spark SQL ML Lib GraphX Streaming
Scala Python R Java Tools

Spark Ecosystem
Apache Spark Core Engine
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
Mlib
(Machine Learning)
Graph X
(Graph
Computation)
Spark R
(R on Spark)
DataFrames
BlinkDB
(Appropriate SQL)
ML Pipelines
An appropriate query
engine. To run over Core
Spark Engine
Used for structured data.
Can run unmodified hive
queries on existing
Hadoop deployment
Enables analytical and
interactive apps for live
streaming data
Machine learning library being built on top of Spark. Provision for support to many machine learning library algorithms with speeds
up to 100 times faster than Map-Reduce
Graph Computation
engine (Similar to Graph)
Package for R-language to
enable R-users to
leverage Spark power
from R shell
Apache MesosYARN Standalone Scheduler

Master Node
Cluster
Management
Worker Node
Executor
Task Cache
Worker Node
Executor
Task Cache
Driver Program
Spark
Context
Spark Architecture

Spark Architecture – Contd…
Driver Program
The main executable program from where Spark Operations are performed.
Controls and coordinates all operations.
The Driver program is the main class.
Executes parallel operations on a cluster.
Defines RDDs.
Each driver program execution is a “Job”.
``

Spark Architecture – Contd… ``
SparkContext
Driver access Spark functionality through a SparkContext object.
Represents a connection to the computing cluster
Used to build RDDs
Works with the cluster management.
Manages executors running on Worker nodes
Splits jobs as parallel “tasks” and executes them on worker nodes.
Partitions RDDs and distributes them on the cluster.
Collects results and presents them to the Driver program.
``

Real Time Big Data Analytics - OptionsSpark Modes
Batch mode
A program is scheduled for execution through the scheduler.
Runs fully at periodic intervals and process data.
Interactive mode
An interactive shell is used by the user to execute Spark commands one-by-one.
Shell acts as the Driver program and provides SparkContext.
Can run tasks on a cluster.
Streaming mode
An always running program continuously process data as it arrives.

`
Hadoop VS Spark
Hadoop Spark
Stores data on disk. Stores data in memory (RAM).
Commodity hardware can be utilized Need high end systems with greater RAM.
Use Replication to achieve fault tolerance Use different data storage models to achieve fault
tolerance. (E.g RDD).
Speed of processing is less due to disk read write 100x faster than Hadoop
Supports only Java Supports Java, Python , Scala. Ease of programming is high.
Everything is Just Map & Reduce Supports Map, Reduce, SQL, Streaming etc.
Data should be in HDFS Data can be in HDFS, Cassandra, Hbase. Runs on Hadoop,
Cloud, Mesos or standalone.

Ecosystem of Hadoop VS Spark
``
Batch Processing
Spark batch can be used over Hadoop MapReduce.
Structured Data Analysis
Spark SQL can be used instead of Hive QL
Machine Learning Analysis
MLLib can be used for clustering, recommendation and classification.
Interactive SQL Analysis
Spark SQL can be used over Impala & Hive.
Real-time streaming Data Analysis
Spark Streaming can be used over specialized library Storm.

Spark Use Cases
``
Companies like NTT Data , Yahoo, GroupON, NASA, Nokia and more are using Spark for creating
applications for different use cases such as:
Stream processing of
network machine data
Performa Big Data
Analytics for
subscriber
personalization and
profile in the
telecommunications
domain
Building data
intelligence and
eCommerce solutions
in the retail industry.

Real Time Big Data Analytics - OptionsIntroduction to RDD
RDD : Resilient Distributed Dataset.
Fault Tolerance Share the data across
cluster of machine
Collection of data
RDD are the primary abstraction
in Spark – a fault tolerant
collection of elements that can
be operated in parallel.
Calculation1 Calculation2 Calculation3
RDD RDD RDD
Spark is built around RDD’s . You create , transform, analyse and store RDDs in Spark.
The Data Science Experts
NPN Training
The Dataset contains a collection of elements of any type.
Strings, Lines, rows, objects, collections

RDD is a fault tolerant collection of elements that can be operated in parallel.
In Spark, datasets are represented as a list of entries, where the list is broken up into many different
partitions that are each stored on a different machine. Each partition holds a unique subset of the entries in
the list.
Spark calls datasets that it stores "Resilient Distributed Datasets" (RDDs).

What are RDD Traits
``
The traits of RDD are
In Memory – Data can be as big as it can be and can be as long as it needs.
Immutable – Read only data, it can only be transformed into new RDD.
Lazily Evaluated – Computed only when action are performed until then RDD is just a definition without data.
Typed – RDD data is typed like Int,String etc.
Parallel – Data processing is done in parallel on each node.
Partitioned – Data in RDD is split into partition and distributed with the nodes in the cluster.
Cached – Data can be in RAM or Disk.

Agenda for Next Class
 Hadoop VM Installation
 Exploring Hadoop Cluster Modes
 Exploring Hadoop Configuration
 Hadoop Commands – Hands-on
 Executing MapReduce Programs
``

Module01

More Related Content

What's hot

Similar to Module01

Recently uploaded

Module01