Apache Spark on HDinsight Training

Build COMPETENCY
across your TEAM
Apache Spark
On HDInsight
Chandrashekhar Deshpande

Module 1
Introduction to Apache Spark

Apache Spark Overview
• An Engine to process big data in faster(than MR), easy and extremely scalable way
• An Open Source, parallel, in-memory processing, cluster computing framework
• Solution for loading, processing and end to end analyzing large scale data
• Iterative and Interactive : Scala, Java, Python, R and with Command line interface
• Stream Processing (Real time streams and DStreams)
• Unifies Big Data with Batch processing, Streaming and Machine Learning
• Appreciated and Widely used by: Amazon, eBay, Yahoo
• Can very well go with Apache Kafka, ZeroMQ, Cassandra etc.
• Powerful platform to implement Lambda and Kappa Architecture

Spark Evolution
• Recent release of Spark is 2.3
• We will work on 2.1.1

Spark - Benefits
5
Performance
Using in-memory computing, Spark is
considerably faster than Hadoop (100x in
some tests).
Can be used for batch and real-time data
processing.
Developer Productivity
Easy-to-use APIs for processing large
datasets.
Includes 100+ operators for transforming.
Ecosystem
Spark has built-in support for many data
sources such as HDFS, RDBMS, S3, Apache
Hive, Cassandra and MongoDB.
Runs on top the Apache YARN resource
manager.
Unified Engine
Integrated framework includes higher-level
libraries for interactive SQL queries,
processing streaming data, machine learning
and graph processing.
A single application can combine all types of
processing

Spark is fast
6
Spark is the current (2014) Sort Benchmark winner.
3x faster than 2013 winner (Hadoop).
tinyurl.com/spark-sort

… especially for iterative applications
7Logistic regression on a 100-node cluster with 100 GB of data
Logistic Regression
140
120
100
80
40
20
0
60
Hadoop
Spark 0.9

Tuples of MR Vs. RDDs of Spark
Tuples
in Map
Reduce
RDDs in
Spark

Spark in-memory
• Spark does all intermediate steps in-memory…
• Faster in execution with fewer Secondary Storage r/w.
• Memory extensive
• The memory objects are called as Resilient Distributed Datasets.
• RDD’s are partitioned memory objects existing in multiple worker machines along with their
replicas.

A unified Framework
10
An unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Spark SQL
Interactive
Queries
Spark
Streaming
Stream processing
Spark MLlib
Machine
Learning
GraphX
Graph
Computation
Yarn Mesos
Standalone
Scheduler

Advantages of Unified Platform
11
Spark Streaming
Machine
Learning
Spark SQL

Spark Framework
Prog. Inter
Library
Engine
Platforms
Storage
Scala
SQL
Spark Core
YARN
Local
Java
ML Lib
MESOS
HDFS
Python
Graph X
Scheduler
HBase
R
Streaming
Standalone
RDBMS NoSQL AWS S3
Azure Blob Storage / Data Lake Store

Spark Modes
• Batch mode: A scheduled program executed through scheduler in periodic
manner to process data.
• Interactive mode: Execute spark commands through Spark Interactive command
interface.
• The Shell provides default Spark Context and works as a Driver Program. The Spark context
runs tasks on Cluster.
• Stream mode: Process stream data in real time fashion.

Spark Scalability
Single Cluster, Stand-alone,
Single Box
• All components (Driver,
Executors) run within same
JVM.
• Partitions data for multiple
core.
• Runs as Single Threaded mode.
Managed Clustered
• Can scale from 2 to 1000
nodes.
• Can use different cluster
managers like- YARN, MESOS
etc.
• Partitions data for all nodes.

Area of applicability
• Data Integration and ETL
• Interactive Analytics
• High performance batch and micro-batch computations
• Advanced and complex analytics
• Machine learning
• Real time stream processing including IoT
• Example:
• Market trends and patterns
• Predicting sales
• Credit card frauds detection
• Network intrusion detection
• Advertisement targeting
• Customer’s 360 analysis

Spark – Use cases
17
Use case Description Users
Data Integration and
ETL
Cleansing and combining data from
diverse sources
Palantir: Data analytics platform
Interactive analytics
Gain insight from massive data sets tin
ad hoc investigations or regularly
planned dashboards.
Goldman Sachs: Analytics platform
Huawei: Query platform in the
telecom sector.
High performance
batch computation
Run complex algorithms against large
scale data
Novartis: Genomic Research
MyFitnessPal: Process food data
Machine Learning
Predict outcomes to make decisions
based on input data
Alibaba: Marketplace Analysis
Spotify: Music Recommendation
Real-time stream
processing
Capturing and processing data
continuously with low latency and high
reliability
Netflix: Recommendation Engine
British Gas: Connected Homes

Module 2
Sparks Architecture and Basics

Spark Architecture
• Spark Master: Manages number of applications. In
HDInsight, it also manages resources at cluster level.
• Spark Driver: Per application to manage workflow of an
application.
• Spark Context: Created by driver and keeps track and
metadata of RDDs. Gives API to exercise various
features of Spark.
• Worker Node: Read and write data from and to
HDFS/Storage.

Spark Execution components
• Driver Program
• It’s a main initiating program from where Spark operations are defined. It is executed in
Master Node.
• It controls and co-ordinates all operations.
• It defines RDDs.
• Each driver program execution is a ‘Job’.

Spark Execution components
• Spark Context
• Provides access to Spark functionalities.
• Represents connection to the computing cluster.
• Builds, partitions and distributes RDDs to clusters.
• Works with Cluster Manager
• Splits job as parallel task and executes them on worker nodes
• Collects and accumulates results and present them to the driver program

Resilient Distributed Datasets
• Operations with spark are mostly with RDDs. We create, transform, analyze and store RDDs in
Spark operations.
• RDDs are fast in access as stored in memory.
• Partitioned and Distributed as divided in parts and each part exist in a cluster.
• The data sets are formed of Strings, rows, objects, collection.
• They are immutable.
• To change, apply transformation and create new RDD.
• They can be cached and persisted.
• Actions produce summarized results.

Demo 1
TestSpark010_FirstProgram.java
Aim: This program basically demonstrates...
1. Configuring Spark context.
2. Using Resilient Distributed Datasets.
3. Introduction to Mappings, Filtering, Actions etc.
4. Creating String RDD from CSV text file.
5. Applying simple Map to convert strings to Upper case
6. Browser monitoring of Spark Context.

Spark architecture
Master
Node
Worker
Nodes

Module 3
Working with RDD’s and Paired RDDs in Spark

Loading data.
• RDD Loading sources…
• Text files
• JSON files
• Sequence files of HDFS
• Parallelize on collection
• Java Collection
• Python Lists
• R data frames
• RDBMS/NoSQL
• Use direct API
• Bring it first as Collection (DAO Classes) and then create RDD.
• Very large data sets
• Create HDFS out side spark and then create RDDs using Apache Sqoop.
• Spark aligns its own partitioning techniques to the partitioning of Hadoop.

Storing data
• Storing of RDD in variety of data sinks.
• Text files
• JSON
• Sequence Files
• Collection
• RDBMS/NoSQL
• For persistence
• Spark Utilities
• Language specific support
Spark power lies in processing data in distribute manner.
Though Spark provides API for Loading and sinking of data,
here where its real power does not lay. Out side capabilities
are recommended for simplicity and performance.

Lazy Evaluation
• Spark will not load or transform data unless action is encountered.
• Step 1: Load a file content into RDD.
• Step 2: Apply filtration
• Step 3: Count for number of records (Now Step 1 to 3 are executed).
• The above statement is true even for interactive mode.
• The lazy evaluation helps spark to optimize operations and manage resources in
better way.
• Makes trouble shooting difficult: Any problem in loading is detected while
executing Action.

Transformations
• Recall: RDDs are immutable.
• Transformation: Operation on RDD to create a new RDD.
• Examples: Maps, Flat map, Filter etc.
• Operate on one element at a time.
• Evaluates lazily.
• Distributed across multiple nodes and executed by Executor within a cluster
on local RDD independently.
• Creates its own subset of resultant RDD.

Transformations: Maps
JavaRDD<String> mutateRDD1 = autoAllData.map(function);
• Simulate Map Reduce of Hadoop.
• Element level computation or transformation.
• The result RDD may have same number of elements as source RDD.
• Result type may be different.
• In java it allows Lambda expressions or anonymous class.
• Scala/Python: Inline function, function reference allowed.
• Use cases:
• Data standardization of may be names
• Data type conversion i.e. from String to Custom object.
• Data Computation like tax calculations.
• Adding new attributes like calculating Grade.
• Data checking, cleansing etc.

Transformations: Filters
JavaRDD<String> mutateRDD1 = autoAllData.filter(function);
• From RDD, it selects element which passes given criterion.
• Result in RDD of smaller size than original RDD as some records getting eliminated.
• The filter() takes a function which returns Boolean value.
• In java it allows Lambda expressions (Predicate) or anonymous class.
• Scala/Python: Inline function, function reference allowed.

Actions
• Acts on entire RDD to reduce to a precise and consolidated result.
• Max/Min, Summerization
• Spark lazily evaluates all processing on encountering an Action.
• Simple Actions
• The collect() operation: Converts RDD into collection.
• The count() operation: Count number of elements of RDD.
• The first() operation: Returns first record as a string.
• The take(n) operation: Returns first ‘n’ elements as a List of Strings.

Module 4
Apache Spark on Azure HDInsight

Apache Spark on HDInsight
Azure Storage
Azure Data Lake Store
Hive and HBase
Azure Data Factory
Event Hub
S
P
A
R
K
Apache Kafka
Apache Flum
Orchestration

Spark Support on HDInsight
35
Feature Description
SLA 99.9% uptime
Ease of creating a
cluster
Possible using Azure Portal, Azure Powershell, Azure Insight .Net SDK.
Ease of use For interactive data processing and visualization, Jupyter and Zeppeline notebooks are provided.
REST APIs
Spark Cluster in HDInsight include Livy, a REST API based Spark Job Server to remotely submit and
monitor jobs.
Azure DataLake
The Azure DataLake Store can be used as primary storage (HDInsight 3.5 onwards) or an additional
storage.
Integration with
Azure services
Provides connectors for Azure Event Hub, Kafka.
R Server Can setup R Server and run R computations.
Concurrent
Queries
Supports concurrent queries. This enables multiple queries from one user or multiple queries from
various users and applications to share the same cluster resources.
SSD cache Cache of data either in memory or in SSD for better performance.
BI tools
integration
Connectors available for Power BI and Tableau for data analytics.
Machine Learning
Libraries
Preloaded 200 Anaconda libraries for Machine Learning, data analysis and visualization.
Scalability The Cloude’s prominent feature

Azure Data Lake Store
37
• Enterprise-wide hyper-scale repository for big data analytic workloads.
• Can capture data of any size, type, and ingestion speed in one single place
for operational and exploratory analytics.
• Hadoop accesses Data Lake Store through Web-HDFS REST API.
• Is tuned for performance for data analytics scenarios.
• Supports all enterprise-grade capabilities—security, manageability,
scalability, reliability, and availability.
• Stores variety of data in native format without transformation. Can handle
structured, semi-structured, and unstructured data.

Azure Storage Vs. Azure Data Lake
Azure Data Lake Store Azure Blob Storage
General purpose scalable object store Hyper-scale repository for big data analytics workload.
Use cases: Batch, interactive, streaming analytics and
machine learning data such as log files, IoT data, click
streams, large datasets
Any type of text or binary data, such as application back
end, backup data, media storage for streaming and
general purpose data.
Folders containing data in files Containers containing data in blobs.
Hierarchical file system Object store with flat namespace
Authentication: Azure AD identity Account Access Keys, Shared signature Access key
Optimized performance for parallel analytical workload Not optimized for analytics workload
Size Limit: No limit on file size or number of size. Specific limits as mentioned in document.
Geo-Redundancy: Locally redundant. Locally redundant, Globally redundant, Read Access
Globally Redundant.

Azure Data Factory
40
• A cloud data integration service, to compose data storage,
movement, and processing services into automated data
pipelines.
• Can handle ETL and complex hybrid ETL.
• Allows you to create data-driven workflows in the cloud for
orchestrating and automating data movement and data
transformation.

Azure Event Hub
42
• A highly scalable ingestion system
• Can ingest millions of events per second, enabling an application to process
and analyze the massive amounts of data produced by your connected
devices and applications.
• Works as Event Ingestor which works as intermediate between even
publisher and event consumer.
• Decouples production of Even streams from consumption mechanism.
• Enables behavior tracking in mobile apps, traffic information from web
farms, in-game event capture in console games, or telemetry collected from
industrial machines, connected vehicles, or other devices.

Demo 3.1 : Spark cluster on HDInsight
44

Creating a HDInsight Spark Cluster
45

HDInsight Spark Resource Manager
47
The Resource Manager
enables you to control the
number of cores and amount
of memory allocated to Spark
cluster components and
notebooks.
Increasing the resources
allocated to the Thrift Server
can potentially improve the
performance with BI Tools

Resizing a HDInsight Spark Cluster
48

Notebooks: Jupyter and Zeppelin
49

HDInsight Spark: Jupyter Notebooks
50
A Jupyter notebook showing a Spark program in Python 2

HDInsight Spark: Zeppelin Notebooks
51
Zeppelin notebook must be connected to a
Spark cluster to run
A notebook ‘paragraph’ can be executed
by clicking this icon
Like Jupytr, Zeppelin enables interactive
charts and graphs to be easily included in
a notebook.
You can control the visualization using the
“Settings” drop-down menu
A number of charts and graphs are already
built into the Zeppelin notebook

The Visualization tools: PowerBI

Module 5
Apache Spark SQL and Data Frames

Overview
Spark SQL
• It’s a library built on Spark to support SQL Like operations.
• Facilitate eliminating RDDs from API for simplicity.
• Traditional RDBMS developers can easily transit to Big Data.
• Works with Structured Data that has a Schema.
• Seamlessly mix SQL Queries with Spark Programs.
• Supports JDBC.
• Mix with RDBMS and NoSQL.

Data Frames
Spark Session
• Like SparkContext for RDDs.
• Gives Data Frames and Temp Tables.
Data Frames
• RDDs are for Spark Core while Data Frames are for Spark SQL.
• Built upon RDDs
• It’s a distributed collection of data organized as Rows and Columns.
• Has a schema with column names and column types.
• Interoperability with…
• Collections, CSV, Data Bases, Hive/NoSQL tables, JSON, RDD etc.

Operations on Data Frames
• The filter: Its like ‘where’ clause.
• The join: Its like joins in SQL.
• The groupby: For grouping to get consolidation.
• The agg: Compute aggregation like sum, average.
• Allows mapping and reducing.
• Operations nesting allowed.

Spark SQL
• Spark SQL :
• Not intended for interactive/exploratory analysis.
• Spark SQL reuses the Hive frontend and meta-store.
• Gives full compatibility with existing Hive data, queries, and UDFs.
• Spark SQL includes a cost-based optimizer, columnar storage and code generation to make
queries fast.
• It scales to thousands of nodes and multi hour queries using the Spark engine. Performance
is its biggest advantage.
• Provides full mid-query fault tolerance.

Module 6
Apache Spark Streaming

What is Streaming?
Standing Queries
Query
Logic
Sources Targets
`
Devices, Sensors
Web servers
Pagers &
Monitoring devices
KPI Dashboards
Input
Adapters
Output
AdaptersStreaming Engine
Query
Logic
Query
Logic
Application at Runtime

Why Spark Streaming?
• One of the real powers of Spark
• Typically analytics is performed on data at rest:
• Databases, Flat files. The historical data, survey data etc.
• The real time analytics is performed on data the moment generated
• Complex Event Processing, Fraud detections, click stream processing etc.
• What Spark Stream can do?
• Look at data the moment it is arrived from source.
• Transform, summarize, analyze
• Perform machine learning
• Prediction in real time

Spark Streaming and MicroBatch Processing

Spark Streaming
Credit card fraud detection with high scalability and parallelism.
Spam filtering
Network intrusion detection
Real time social media analytics
Click Stream analytics
Stock market analysis
Advertise analytics

Spark Streaming architecture
Master Node
Driver Program
Spark Context
Stream
Context
Cluster
Manager
Worker Node
Executor
Long
Task Receiver
Input Source
Worker Node
Worker Node
Executor
Tas
k
Cache

Spark Streaming architecture
A master node is with driver program with Spark Context.
Create a streaming context from Spark Context.
One of the worker node is assigned a long task of listening a source.
The receiver keeps receiving data from input source.
The receiver propagate data to worker nodes.
The normal tasks in worker node act upon data.

The DStream
• Discretized stream
• Created from Stream context
• The micro batch window is set up for Dstream (Normally in secords)
• The micro batch window is a small time slice (around 3 sec) in which
generated real time data is accumulated as batch and wrapped in RDD called
Dstream.
• The Dstream allows all RDD operations.
• A common data can be shared across Global Variables

The Dstream windowing functions
• They are for computing across multiple Dstreams.
• All RDD functions are applicable on data accumulated from last X
batches.
• Ex: Accumulate last 3 batches together, Average of something of last 5
batches.

Windowing in Spark
68
1 2 3 4 5 6 7 8 9 10 11 12
First 8 sec Window Second 8 sec Window

Windowing Sample Code
69
Callback OperationWindow Length Sliding IntervalWindow Operation

“Exactly once”, “At Least Once” Guarantees
70

Streaming with Azure Event Hub
71
Azure Event Hub HDInsight Spark Streaming Power BI

Module 7
Apache Spark : Analytics and Machine Learning

Types of Analytics
Descriptive Analytics :
• Defining problem
statement. What exactly
happened.
Exploratory Data
Analytics:
• Why something is
happening.
Inferential Analytics:
• Understand population
from the sample. Take
sample and extrapolate to
whole population.
Predictive Analytics:
• Forecast what will
happen.
Causal Analytics:
• Variables are related.
Understanding effect of
change in one variable to
another variable.
Deep Analytics:
• Analytics uses multi-
source data sources,
combining some or all
above analytics.

Data Analytics is needed everywhere –
Recommendation
engines
Smart meter
monitoringEquipment monitoringAdvertising analysisLife sciences research
Fraud
detection
Healthcare outcomes
Weather forecasting
for business planningOil & Gas exploration
Social network analysis
Churn
analysis
Traffic flow
optimization
IT infrastructure &
Web App optimization
Legal
discovery and
document archivingIntelligence Gathering
Location-based
tracking & services
Pricing Analysis
Personalized Insurance

Machine Learning in Sparks
• Makes ML easy
• Standard and common interface for different ML algorithms.
• Contains algorithms and utilities.
• It has two machine learning libraries.
• The spark.mllib: Original API built on RDDs. May be deprecated soon.
• The spark.ml: New higher level API built on Data Frames.
• The Machine Learning Algorithms of Spark uses these data types…
• Local Vector
• Labeled Point
• Every data to be submitted to ML must be converted to these data types.

Other algorithms supported
• Decision Tree
• Dimensionality Reduction
• Random Forest
• Linear Regression
• Naïve Bayes Classification
• K-Means Clustering
• Recommendation Engines
• …. And many more

Q & A
Contact: chandrashekhardeshpande@synergetics-india.com,
maheshshinde@synergetics-india.com

References
• Online reference:
• http://spark.apache.org/docs/latest/index.html
• http://spark.apache.org/docs/latest/programming-
guide.html
• http://spark.apache.org/docs/latest/api/java/index.html

Apache Spark on HDinsight Training

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark on HDinsight Training

Similar to Apache Spark on HDinsight Training (20)

More from Synergetics Learning and Cloud Consulting

More from Synergetics Learning and Cloud Consulting (20)

Recently uploaded

Recently uploaded (20)

Apache Spark on HDinsight Training

Editor's Notes