Naveen P.N
Trainer
Module 01 – Apache Spark Introduction
NPN TrainingTraining is the essence of success and we are committed to it
www.npntraining.com
Topics for the Module `
History of Apache Spark
Batch VS Real-Time Processing
Limitation of MapReduce
Introduction to Apache Spark
Features of Apache Spark
Data Sharing in MapReduce
Data Sharing in Apache Spark
Understanding Spark Deployment
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
After completing the module, you will be able to understand:
Overview of Spark Eco-System
Overview of Spark Architecture
Understanding Spark Modes
Hadoop VS Spark
Eco-System of Hadoop VS Spark
Spark Use-cases
Introduction to RDD
RDD Traits
History of Spark
Started at UC
Berkeley AMPLab
by MateiZaharia
Open sourced
under a BSD
license
The project was given to
the Apache Foundation
and the license was
changed to Apache 2.0
2009
2009
2009
2014
Present
Exists as a next generation
real-time and batch
processing framework
Became an Apache Top-
Level Project. Used by the
engineering team at
Databricks to set a world
record in large-scale.
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Batch VS Real-Time Processing ``
The features below show a comparison of batch and real-time analytics in the enterprise use cases:
1. Data processing takes place upon data entry or
command receipt instantaneously.
2. It must execute on response time within stringent
constraints.
Example : Fraud detection.
1. Large group of data/transactions is processed in a
single run.
2. Job run without any manual intervention.
3. The entire data is pre-selected and fed using
command line parameters and scripts.
4. It is used to execute multiple operations, handle
heavy data load, reporting, and offline data
workflow
Example : Regular, Weekly report required for decision
making
Batch Processing Real-Time Processing
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Limitation of MapReduce ``
The limitation of MapReduce in Hadoop are listed below:
Unsuitable in real-time processing
Being batch oriented, it takes minutes to execute jobs depending upon the amount of data and number of nodes in the cluster.
Unsuitable for trivial operations
For operations like Filtering and Joins , you might need to rewrite the jobs, which becomes complex because of key-value pairs.
Unfit for large data on network
However it works on the data locality principle, it cannot process a lot of data requiring shuffling over the network well.
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Limitation of MapReduce Contd… ``
Unfit for processing graph
The Apache Giraph library processes graphs, which adds additional complexity on top of MapReduce.
Unfit for iterative execution
Being a state-less execution, MapReduce doesn’t fit with use cases like Kmeans that need iterative execution.
Unsuitable with OLTP (Online Transaction Processing)
OLTP requires a large number of short transactions, as it works on batch-oriented framework.
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Introduction to Spark
Apache Spark is a lightning-fast cluster computing framework, designed for fast computation.
Spark uses Hadoop in two ways
1. one is storage
2. second is processing
Since Spark has its own cluster management computation, it uses Hadoop for storage purpose only.
As against a common belief, Spark is not a modified version of Hadoop, and is really not dependent on
Hadoop because it has its own cluster management. Hadoop is just one of the ways to implement
Spark.
Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms ,
interactive queries and streaming.
The main feature of Spark is its in-memory cluster computing that increases the processing speed of
an application.
Features of Apache Spark
Speed - Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10
times faster when running on disk. This is possible by reducing number of read/write operations to disk. It
stores the intermediate processing data in memory.
Supports multiple languages - Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can
write applications in different languages. Spark comes up with 80 high-level operators for interactive
querying.
Advance Analytics - Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming
data, Machine learning (ML), and Graph algorithms.
Apache Spark has following features.
www.npntraining.com/courses/apache-spark-scala-training.php
Ideal Solution for Big Data Analytics
Batch
Streaming Interactive
Single
Framework
Batch
The collection and storage for data, for processing
at a scheduled time when a sufficient amount of
data has been accumulated.
Streaming
Continual streaming of data.
Interactive ecosystem or Hadoop stack. It
allows other components to run on top of
stack.
``
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Data Sharing in Map Reduce
HDFS
M
A
P
P
E
R
SHUFFLE
&
SORT
R
E
D
U
C
E
R
M
A
P
P
E
R
SHUFFLE
&
SORT
R
E
D
U
C
E
R
HDFS
IO Operation IO Operation IO Operation IO Operation
IO Operation IO Operation IO Operation IO Operation
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Data Sharing in Apache Spark
In MapReduce lots of IO operation happens to process the data so it is not good for intensive data iterative algorithms .
query1
query2
query3
HDFS
One-time
processing
Distributed Memory
10 – 100x faster than
network and disk
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Spark Built on Hadoop
Three ways of how Spark can be built with Hadoop components.
Spark
HDFS
Standalone
HDFS HDFS
Spark
YARN
HDFS
Spark
Hadoop 2.x YARN Hadoop V1 (SIMR)
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Spark Deployment
There are three ways of Spark deployment
Standalone - Spark Standalone deployment means Spark occupies the place on top of HDFS(Hadoop
Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark and MapReduce will run side by
side to cover all spark jobs on cluster.
Hadoop YARN - Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre-installation or
root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other
components to run on top of stack.
Spark in MapReduce (SIMR) - Spark in MapReduce is used to launch spark job in addition to standalone
deployment. With SIMR, user can start Spark and uses its shell without any administrative access.
``
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Spark Ecosystem
Programming
Library
Engine
Management
Storage
Apache Spark Core Engine
YARN Mesos Spark Scheduler
Local HDFS S3 RDBMS NoSQL
Spark SQL ML Lib GraphX Streaming
Scala Python R Java Tools
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Spark Ecosystem
Apache Spark Core Engine
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
Mlib
(Machine Learning)
Graph X
(Graph
Computation)
Spark R
(R on Spark)
DataFrames
BlinkDB
(Appropriate SQL)
ML Pipelines
An appropriate query
engine. To run over Core
Spark Engine
Used for structured data.
Can run unmodified hive
queries on existing
Hadoop deployment
Enables analytical and
interactive apps for live
streaming data
Machine learning library being built on top of Spark. Provision for support to many machine learning library algorithms with speeds
up to 100 times faster than Map-Reduce
Graph Computation
engine (Similar to Graph)
Package for R-language to
enable R-users to
leverage Spark power
from R shell
Apache MesosYARN Standalone Scheduler
Master Node
Cluster
Management
Worker Node
Executor
Task Cache
Worker Node
Executor
Task Cache
Driver Program
Spark
Context
Spark Architecture
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Spark Architecture – Contd…
Driver Program
The main executable program from where Spark Operations are performed.
Controls and coordinates all operations.
The Driver program is the main class.
Executes parallel operations on a cluster.
Defines RDDs.
Each driver program execution is a “Job”.
``
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Spark Architecture – Contd… ``
SparkContext
Driver access Spark functionality through a SparkContext object.
Represents a connection to the computing cluster
Used to build RDDs
Works with the cluster management.
Manages executors running on Worker nodes
Splits jobs as parallel “tasks” and executes them on worker nodes.
Partitions RDDs and distributes them on the cluster.
Collects results and presents them to the Driver program.
``
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Real Time Big Data Analytics - OptionsSpark Modes
Batch mode
A program is scheduled for execution through the scheduler.
Runs fully at periodic intervals and process data.
Interactive mode
An interactive shell is used by the user to execute Spark commands one-by-one.
Shell acts as the Driver program and provides SparkContext.
Can run tasks on a cluster.
Streaming mode
An always running program continuously process data as it arrives.
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
`
Hadoop VS Spark
Hadoop Spark
Stores data on disk. Stores data in memory (RAM).
Commodity hardware can be utilized Need high end systems with greater RAM.
Use Replication to achieve fault tolerance Use different data storage models to achieve fault
tolerance. (E.g RDD).
Speed of processing is less due to disk read write 100x faster than Hadoop
Supports only Java Supports Java, Python , Scala. Ease of programming is high.
Everything is Just Map & Reduce Supports Map, Reduce, SQL, Streaming etc.
Data should be in HDFS Data can be in HDFS, Cassandra, Hbase. Runs on Hadoop,
Cloud, Mesos or standalone.
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Ecosystem of Hadoop VS Spark
``
Batch Processing
Spark batch can be used over Hadoop MapReduce.
Structured Data Analysis
Spark SQL can be used instead of Hive QL
Machine Learning Analysis
MLLib can be used for clustering, recommendation and classification.
Interactive SQL Analysis
Spark SQL can be used over Impala & Hive.
Real-time streaming Data Analysis
Spark Streaming can be used over specialized library Storm.
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Spark Use Cases
``
Companies like NTT Data , Yahoo, GroupON, NASA, Nokia and more are using Spark for creating
applications for different use cases such as:
Stream processing of
network machine data
Performa Big Data
Analytics for
subscriber
personalization and
profile in the
telecommunications
domain
Building data
intelligence and
eCommerce solutions
in the retail industry.
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Real Time Big Data Analytics - OptionsIntroduction to RDD
RDD : Resilient Distributed Dataset.
Fault Tolerance Share the data across
cluster of machine
Collection of data
RDD are the primary abstraction
in Spark – a fault tolerant
collection of elements that can
be operated in parallel.
Calculation1 Calculation2 Calculation3
RDD RDD RDD
Spark is built around RDD’s . You create , transform, analyse and store RDDs in Spark.
The Data Science Experts
NPN Training
The Dataset contains a collection of elements of any type.
Strings, Lines, rows, objects, collections
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Data Sharing in Apache Spark
RDD is a fault tolerant collection of elements that can be operated in parallel.
In Spark, datasets are represented as a list of entries, where the list is broken up into many different
partitions that are each stored on a different machine. Each partition holds a unique subset of the entries in
the list.
Spark calls datasets that it stores "Resilient Distributed Datasets" (RDDs).
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Data Sharing in Apache Spark
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
What are RDD Traits
``
The traits of RDD are
In Memory – Data can be as big as it can be and can be as long as it needs.
Immutable – Read only data, it can only be transformed into new RDD.
Lazily Evaluated – Computed only when action are performed until then RDD is just a definition without data.
Typed – RDD data is typed like Int,String etc.
Parallel – Data processing is done in parallel on each node.
Partitioned – Data in RDD is split into partition and distributed with the nodes in the cluster.
Cached – Data can be in RAM or Disk.
http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
Agenda for Next Class
 Hadoop VM Installation
 Exploring Hadoop Cluster Modes
 Exploring Hadoop Configuration
 Hadoop Commands – Hands-on
 Executing MapReduce Programs
``
 Module01

Module01

  • 1.
    Naveen P.N Trainer Module 01– Apache Spark Introduction NPN TrainingTraining is the essence of success and we are committed to it www.npntraining.com
  • 2.
    Topics for theModule ` History of Apache Spark Batch VS Real-Time Processing Limitation of MapReduce Introduction to Apache Spark Features of Apache Spark Data Sharing in MapReduce Data Sharing in Apache Spark Understanding Spark Deployment http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php After completing the module, you will be able to understand: Overview of Spark Eco-System Overview of Spark Architecture Understanding Spark Modes Hadoop VS Spark Eco-System of Hadoop VS Spark Spark Use-cases Introduction to RDD RDD Traits
  • 3.
    History of Spark Startedat UC Berkeley AMPLab by MateiZaharia Open sourced under a BSD license The project was given to the Apache Foundation and the license was changed to Apache 2.0 2009 2009 2009 2014 Present Exists as a next generation real-time and batch processing framework Became an Apache Top- Level Project. Used by the engineering team at Databricks to set a world record in large-scale. http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 4.
    Batch VS Real-TimeProcessing `` The features below show a comparison of batch and real-time analytics in the enterprise use cases: 1. Data processing takes place upon data entry or command receipt instantaneously. 2. It must execute on response time within stringent constraints. Example : Fraud detection. 1. Large group of data/transactions is processed in a single run. 2. Job run without any manual intervention. 3. The entire data is pre-selected and fed using command line parameters and scripts. 4. It is used to execute multiple operations, handle heavy data load, reporting, and offline data workflow Example : Regular, Weekly report required for decision making Batch Processing Real-Time Processing http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 5.
    Limitation of MapReduce`` The limitation of MapReduce in Hadoop are listed below: Unsuitable in real-time processing Being batch oriented, it takes minutes to execute jobs depending upon the amount of data and number of nodes in the cluster. Unsuitable for trivial operations For operations like Filtering and Joins , you might need to rewrite the jobs, which becomes complex because of key-value pairs. Unfit for large data on network However it works on the data locality principle, it cannot process a lot of data requiring shuffling over the network well. http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 6.
    Limitation of MapReduceContd… `` Unfit for processing graph The Apache Giraph library processes graphs, which adds additional complexity on top of MapReduce. Unfit for iterative execution Being a state-less execution, MapReduce doesn’t fit with use cases like Kmeans that need iterative execution. Unsuitable with OLTP (Online Transaction Processing) OLTP requires a large number of short transactions, as it works on batch-oriented framework. http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 7.
    Introduction to Spark ApacheSpark is a lightning-fast cluster computing framework, designed for fast computation. Spark uses Hadoop in two ways 1. one is storage 2. second is processing Since Spark has its own cluster management computation, it uses Hadoop for storage purpose only. As against a common belief, Spark is not a modified version of Hadoop, and is really not dependent on Hadoop because it has its own cluster management. Hadoop is just one of the ways to implement Spark. Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms , interactive queries and streaming. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.
  • 8.
    Features of ApacheSpark Speed - Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory. Supports multiple languages - Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages. Spark comes up with 80 high-level operators for interactive querying. Advance Analytics - Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms. Apache Spark has following features. www.npntraining.com/courses/apache-spark-scala-training.php
  • 9.
    Ideal Solution forBig Data Analytics Batch Streaming Interactive Single Framework Batch The collection and storage for data, for processing at a scheduled time when a sufficient amount of data has been accumulated. Streaming Continual streaming of data. Interactive ecosystem or Hadoop stack. It allows other components to run on top of stack. `` http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 10.
    Data Sharing inMap Reduce HDFS M A P P E R SHUFFLE & SORT R E D U C E R M A P P E R SHUFFLE & SORT R E D U C E R HDFS IO Operation IO Operation IO Operation IO Operation IO Operation IO Operation IO Operation IO Operation http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 11.
    Data Sharing inApache Spark In MapReduce lots of IO operation happens to process the data so it is not good for intensive data iterative algorithms . query1 query2 query3 HDFS One-time processing Distributed Memory 10 – 100x faster than network and disk http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 12.
    Spark Built onHadoop Three ways of how Spark can be built with Hadoop components. Spark HDFS Standalone HDFS HDFS Spark YARN HDFS Spark Hadoop 2.x YARN Hadoop V1 (SIMR) http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 13.
    Spark Deployment There arethree ways of Spark deployment Standalone - Spark Standalone deployment means Spark occupies the place on top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobs on cluster. Hadoop YARN - Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre-installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other components to run on top of stack. Spark in MapReduce (SIMR) - Spark in MapReduce is used to launch spark job in addition to standalone deployment. With SIMR, user can start Spark and uses its shell without any administrative access. `` http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 14.
    Spark Ecosystem Programming Library Engine Management Storage Apache SparkCore Engine YARN Mesos Spark Scheduler Local HDFS S3 RDBMS NoSQL Spark SQL ML Lib GraphX Streaming Scala Python R Java Tools http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 15.
    Spark Ecosystem Apache SparkCore Engine Spark SQL (SQL) Spark Streaming (Streaming) Mlib (Machine Learning) Graph X (Graph Computation) Spark R (R on Spark) DataFrames BlinkDB (Appropriate SQL) ML Pipelines An appropriate query engine. To run over Core Spark Engine Used for structured data. Can run unmodified hive queries on existing Hadoop deployment Enables analytical and interactive apps for live streaming data Machine learning library being built on top of Spark. Provision for support to many machine learning library algorithms with speeds up to 100 times faster than Map-Reduce Graph Computation engine (Similar to Graph) Package for R-language to enable R-users to leverage Spark power from R shell Apache MesosYARN Standalone Scheduler
  • 16.
    Master Node Cluster Management Worker Node Executor TaskCache Worker Node Executor Task Cache Driver Program Spark Context Spark Architecture http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 17.
    Spark Architecture –Contd… Driver Program The main executable program from where Spark Operations are performed. Controls and coordinates all operations. The Driver program is the main class. Executes parallel operations on a cluster. Defines RDDs. Each driver program execution is a “Job”. `` http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 18.
    Spark Architecture –Contd… `` SparkContext Driver access Spark functionality through a SparkContext object. Represents a connection to the computing cluster Used to build RDDs Works with the cluster management. Manages executors running on Worker nodes Splits jobs as parallel “tasks” and executes them on worker nodes. Partitions RDDs and distributes them on the cluster. Collects results and presents them to the Driver program. `` http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 19.
    Real Time BigData Analytics - OptionsSpark Modes Batch mode A program is scheduled for execution through the scheduler. Runs fully at periodic intervals and process data. Interactive mode An interactive shell is used by the user to execute Spark commands one-by-one. Shell acts as the Driver program and provides SparkContext. Can run tasks on a cluster. Streaming mode An always running program continuously process data as it arrives. http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 20.
    ` Hadoop VS Spark HadoopSpark Stores data on disk. Stores data in memory (RAM). Commodity hardware can be utilized Need high end systems with greater RAM. Use Replication to achieve fault tolerance Use different data storage models to achieve fault tolerance. (E.g RDD). Speed of processing is less due to disk read write 100x faster than Hadoop Supports only Java Supports Java, Python , Scala. Ease of programming is high. Everything is Just Map & Reduce Supports Map, Reduce, SQL, Streaming etc. Data should be in HDFS Data can be in HDFS, Cassandra, Hbase. Runs on Hadoop, Cloud, Mesos or standalone. http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 21.
    Ecosystem of HadoopVS Spark `` Batch Processing Spark batch can be used over Hadoop MapReduce. Structured Data Analysis Spark SQL can be used instead of Hive QL Machine Learning Analysis MLLib can be used for clustering, recommendation and classification. Interactive SQL Analysis Spark SQL can be used over Impala & Hive. Real-time streaming Data Analysis Spark Streaming can be used over specialized library Storm. http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 22.
    Spark Use Cases `` Companieslike NTT Data , Yahoo, GroupON, NASA, Nokia and more are using Spark for creating applications for different use cases such as: Stream processing of network machine data Performa Big Data Analytics for subscriber personalization and profile in the telecommunications domain Building data intelligence and eCommerce solutions in the retail industry. http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 23.
    Real Time BigData Analytics - OptionsIntroduction to RDD RDD : Resilient Distributed Dataset. Fault Tolerance Share the data across cluster of machine Collection of data RDD are the primary abstraction in Spark – a fault tolerant collection of elements that can be operated in parallel. Calculation1 Calculation2 Calculation3 RDD RDD RDD Spark is built around RDD’s . You create , transform, analyse and store RDDs in Spark. The Data Science Experts NPN Training The Dataset contains a collection of elements of any type. Strings, Lines, rows, objects, collections http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 24.
    Data Sharing inApache Spark RDD is a fault tolerant collection of elements that can be operated in parallel. In Spark, datasets are represented as a list of entries, where the list is broken up into many different partitions that are each stored on a different machine. Each partition holds a unique subset of the entries in the list. Spark calls datasets that it stores "Resilient Distributed Datasets" (RDDs). http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 25.
    Data Sharing inApache Spark http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 26.
    What are RDDTraits `` The traits of RDD are In Memory – Data can be as big as it can be and can be as long as it needs. Immutable – Read only data, it can only be transformed into new RDD. Lazily Evaluated – Computed only when action are performed until then RDD is just a definition without data. Typed – RDD data is typed like Int,String etc. Parallel – Data processing is done in parallel on each node. Partitioned – Data in RDD is split into partition and distributed with the nodes in the cluster. Cached – Data can be in RAM or Disk. http://www.npntraining.com/courses/apache-spark-scala-storm-kafka-training.php
  • 27.
    Agenda for NextClass  Hadoop VM Installation  Exploring Hadoop Cluster Modes  Exploring Hadoop Configuration  Hadoop Commands – Hands-on  Executing MapReduce Programs ``