SlideShare a Scribd company logo
Build COMPETENCY
across your TEAM
Apache Spark
On HDInsight
Chandrashekhar Deshpande
Module 1
Introduction to Apache Spark
Apache Spark Overview
• An Engine to process big data in faster(than MR), easy and extremely scalable way
• An Open Source, parallel, in-memory processing, cluster computing framework
• Solution for loading, processing and end to end analyzing large scale data
• Iterative and Interactive : Scala, Java, Python, R and with Command line interface
• Stream Processing (Real time streams and DStreams)
• Unifies Big Data with Batch processing, Streaming and Machine Learning
• Appreciated and Widely used by: Amazon, eBay, Yahoo
• Can very well go with Apache Kafka, ZeroMQ, Cassandra etc.
• Powerful platform to implement Lambda and Kappa Architecture
Spark Evolution
• Recent release of Spark is 2.3
• We will work on 2.1.1
Spark - Benefits
5
Performance
Using in-memory computing, Spark is
considerably faster than Hadoop (100x in
some tests).
Can be used for batch and real-time data
processing.
Developer Productivity
Easy-to-use APIs for processing large
datasets.
Includes 100+ operators for transforming.
Ecosystem
Spark has built-in support for many data
sources such as HDFS, RDBMS, S3, Apache
Hive, Cassandra and MongoDB.
Runs on top the Apache YARN resource
manager.
Unified Engine
Integrated framework includes higher-level
libraries for interactive SQL queries,
processing streaming data, machine learning
and graph processing.
A single application can combine all types of
processing
Spark is fast
6
Spark is the current (2014) Sort Benchmark winner.
3x faster than 2013 winner (Hadoop).
tinyurl.com/spark-sort
… especially for iterative applications
7Logistic regression on a 100-node cluster with 100 GB of data
Logistic Regression
140
120
100
80
40
20
0
60
Hadoop
Spark 0.9
Tuples of MR Vs. RDDs of Spark
Tuples
in Map
Reduce
RDDs in
Spark
Spark in-memory
• Spark does all intermediate steps in-memory…
• Faster in execution with fewer Secondary Storage r/w.
• Memory extensive
• The memory objects are called as Resilient Distributed Datasets.
• RDD’s are partitioned memory objects existing in multiple worker machines along with their
replicas.
A unified Framework
10
An unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Spark SQL
Interactive
Queries
Spark
Streaming
Stream processing
Spark MLlib
Machine
Learning
GraphX
Graph
Computation
Yarn Mesos
Standalone
Scheduler
Advantages of Unified Platform
11
Spark Streaming
Machine
Learning
Spark SQL
Spark in Lambda Architecture
Spark Framework
Prog. Inter
Library
Engine
Platforms
Storage
Scala
SQL
Spark Core
YARN
Local
Java
ML Lib
MESOS
HDFS
Python
Graph X
Scheduler
HBase
R
Streaming
Standalone
RDBMS NoSQL AWS S3
Azure Blob Storage / Data Lake Store
Spark Modes
• Batch mode: A scheduled program executed through scheduler in periodic
manner to process data.
• Interactive mode: Execute spark commands through Spark Interactive command
interface.
• The Shell provides default Spark Context and works as a Driver Program. The Spark context
runs tasks on Cluster.
• Stream mode: Process stream data in real time fashion.
Spark Scalability
Single Cluster, Stand-alone,
Single Box
• All components (Driver,
Executors) run within same
JVM.
• Partitions data for multiple
core.
• Runs as Single Threaded mode.
Managed Clustered
• Can scale from 2 to 1000
nodes.
• Can use different cluster
managers like- YARN, MESOS
etc.
• Partitions data for all nodes.
Area of applicability
• Data Integration and ETL
• Interactive Analytics
• High performance batch and micro-batch computations
• Advanced and complex analytics
• Machine learning
• Real time stream processing including IoT
• Example:
• Market trends and patterns
• Predicting sales
• Credit card frauds detection
• Network intrusion detection
• Advertisement targeting
• Customer’s 360 analysis
Spark – Use cases
17
Use case Description Users
Data Integration and
ETL
Cleansing and combining data from
diverse sources
Palantir: Data analytics platform
Interactive analytics
Gain insight from massive data sets tin
ad hoc investigations or regularly
planned dashboards.
Goldman Sachs: Analytics platform
Huawei: Query platform in the
telecom sector.
High performance
batch computation
Run complex algorithms against large
scale data
Novartis: Genomic Research
MyFitnessPal: Process food data
Machine Learning
Predict outcomes to make decisions
based on input data
Alibaba: Marketplace Analysis
Spotify: Music Recommendation
Real-time stream
processing
Capturing and processing data
continuously with low latency and high
reliability
Netflix: Recommendation Engine
British Gas: Connected Homes
Module 2
Sparks Architecture and Basics
Spark Architecture
• Spark Master: Manages number of applications. In
HDInsight, it also manages resources at cluster level.
• Spark Driver: Per application to manage workflow of an
application.
• Spark Context: Created by driver and keeps track and
metadata of RDDs. Gives API to exercise various
features of Spark.
• Worker Node: Read and write data from and to
HDFS/Storage.
Spark Execution components
• Driver Program
• It’s a main initiating program from where Spark operations are defined. It is executed in
Master Node.
• It controls and co-ordinates all operations.
• It defines RDDs.
• Each driver program execution is a ‘Job’.
Spark Execution components
• Spark Context
• Provides access to Spark functionalities.
• Represents connection to the computing cluster.
• Builds, partitions and distributes RDDs to clusters.
• Works with Cluster Manager
• Splits job as parallel task and executes them on worker nodes
• Collects and accumulates results and present them to the driver program
Resilient Distributed Datasets
• Operations with spark are mostly with RDDs. We create, transform, analyze and store RDDs in
Spark operations.
• RDDs are fast in access as stored in memory.
• Partitioned and Distributed as divided in parts and each part exist in a cluster.
• The data sets are formed of Strings, rows, objects, collection.
• They are immutable.
• To change, apply transformation and create new RDD.
• They can be cached and persisted.
• Actions produce summarized results.
Demo 1
TestSpark010_FirstProgram.java
Aim: This program basically demonstrates...
1. Configuring Spark context.
2. Using Resilient Distributed Datasets.
3. Introduction to Mappings, Filtering, Actions etc.
4. Creating String RDD from CSV text file.
5. Applying simple Map to convert strings to Upper case
6. Browser monitoring of Spark Context.
Spark architecture
Master
Node
Worker
Nodes
Module 3
Working with RDD’s and Paired RDDs in Spark
Loading data.
• RDD Loading sources…
• Text files
• JSON files
• Sequence files of HDFS
• Parallelize on collection
• Java Collection
• Python Lists
• R data frames
• RDBMS/NoSQL
• Use direct API
• Bring it first as Collection (DAO Classes) and then create RDD.
• Very large data sets
• Create HDFS out side spark and then create RDDs using Apache Sqoop.
• Spark aligns its own partitioning techniques to the partitioning of Hadoop.
Storing data
• Storing of RDD in variety of data sinks.
• Text files
• JSON
• Sequence Files
• Collection
• RDBMS/NoSQL
• For persistence
• Spark Utilities
• Language specific support
Spark power lies in processing data in distribute manner.
Though Spark provides API for Loading and sinking of data,
here where its real power does not lay. Out side capabilities
are recommended for simplicity and performance.
Lazy Evaluation
• Spark will not load or transform data unless action is encountered.
• Step 1: Load a file content into RDD.
• Step 2: Apply filtration
• Step 3: Count for number of records (Now Step 1 to 3 are executed).
• The above statement is true even for interactive mode.
• The lazy evaluation helps spark to optimize operations and manage resources in
better way.
• Makes trouble shooting difficult: Any problem in loading is detected while
executing Action.
Transformations
• Recall: RDDs are immutable.
• Transformation: Operation on RDD to create a new RDD.
• Examples: Maps, Flat map, Filter etc.
• Operate on one element at a time.
• Evaluates lazily.
• Distributed across multiple nodes and executed by Executor within a cluster
on local RDD independently.
• Creates its own subset of resultant RDD.
Transformations: Maps
JavaRDD<String> mutateRDD1 = autoAllData.map(function);
• Simulate Map Reduce of Hadoop.
• Element level computation or transformation.
• The result RDD may have same number of elements as source RDD.
• Result type may be different.
• In java it allows Lambda expressions or anonymous class.
• Scala/Python: Inline function, function reference allowed.
• Use cases:
• Data standardization of may be names
• Data type conversion i.e. from String to Custom object.
• Data Computation like tax calculations.
• Adding new attributes like calculating Grade.
• Data checking, cleansing etc.
Transformations: Filters
JavaRDD<String> mutateRDD1 = autoAllData.filter(function);
• From RDD, it selects element which passes given criterion.
• Result in RDD of smaller size than original RDD as some records getting eliminated.
• The filter() takes a function which returns Boolean value.
• In java it allows Lambda expressions (Predicate) or anonymous class.
• Scala/Python: Inline function, function reference allowed.
Actions
• Acts on entire RDD to reduce to a precise and consolidated result.
• Max/Min, Summerization
• Spark lazily evaluates all processing on encountering an Action.
• Simple Actions
• The collect() operation: Converts RDD into collection.
• The count() operation: Count number of elements of RDD.
• The first() operation: Returns first record as a string.
• The take(n) operation: Returns first ‘n’ elements as a List of Strings.
Module 4
Apache Spark on Azure HDInsight
Apache Spark on HDInsight
Azure Storage
Azure Data Lake Store
Hive and HBase
Azure Data Factory
Event Hub
S
P
A
R
K
Apache Kafka
Apache Flum
Orchestration
Spark Support on HDInsight
35
Feature Description
SLA 99.9% uptime
Ease of creating a
cluster
Possible using Azure Portal, Azure Powershell, Azure Insight .Net SDK.
Ease of use For interactive data processing and visualization, Jupyter and Zeppeline notebooks are provided.
REST APIs
Spark Cluster in HDInsight include Livy, a REST API based Spark Job Server to remotely submit and
monitor jobs.
Azure DataLake
The Azure DataLake Store can be used as primary storage (HDInsight 3.5 onwards) or an additional
storage.
Integration with
Azure services
Provides connectors for Azure Event Hub, Kafka.
R Server Can setup R Server and run R computations.
Concurrent
Queries
Supports concurrent queries. This enables multiple queries from one user or multiple queries from
various users and applications to share the same cluster resources.
SSD cache Cache of data either in memory or in SSD for better performance.
BI tools
integration
Connectors available for Power BI and Tableau for data analytics.
Machine Learning
Libraries
Preloaded 200 Anaconda libraries for Machine Learning, data analysis and visualization.
Scalability The Cloude’s prominent feature
Azure Storage for HDInsight
Azure Data Lake Store
37
• Enterprise-wide hyper-scale repository for big data analytic workloads.
• Can capture data of any size, type, and ingestion speed in one single place
for operational and exploratory analytics.
• Hadoop accesses Data Lake Store through Web-HDFS REST API.
• Is tuned for performance for data analytics scenarios.
• Supports all enterprise-grade capabilities—security, manageability,
scalability, reliability, and availability.
• Stores variety of data in native format without transformation. Can handle
structured, semi-structured, and unstructured data.
Azure Data Lake Store
Azure Storage Vs. Azure Data Lake
Azure Data Lake Store Azure Blob Storage
General purpose scalable object store Hyper-scale repository for big data analytics workload.
Use cases: Batch, interactive, streaming analytics and
machine learning data such as log files, IoT data, click
streams, large datasets
Any type of text or binary data, such as application back
end, backup data, media storage for streaming and
general purpose data.
Folders containing data in files Containers containing data in blobs.
Hierarchical file system Object store with flat namespace
Authentication: Azure AD identity Account Access Keys, Shared signature Access key
Optimized performance for parallel analytical workload Not optimized for analytics workload
Size Limit: No limit on file size or number of size. Specific limits as mentioned in document.
Geo-Redundancy: Locally redundant. Locally redundant, Globally redundant, Read Access
Globally Redundant.
Azure Data Factory
40
• A cloud data integration service, to compose data storage,
movement, and processing services into automated data
pipelines.
• Can handle ETL and complex hybrid ETL.
• Allows you to create data-driven workflows in the cloud for
orchestrating and automating data movement and data
transformation.
Azure Data Factory
Azure Event Hub
42
• A highly scalable ingestion system
• Can ingest millions of events per second, enabling an application to process
and analyze the massive amounts of data produced by your connected
devices and applications.
• Works as Event Ingestor which works as intermediate between even
publisher and event consumer.
• Decouples production of Even streams from consumption mechanism.
• Enables behavior tracking in mobile apps, traffic information from web
farms, in-game event capture in console games, or telemetry collected from
industrial machines, connected vehicles, or other devices.
Event Hub
Demo 3.1 : Spark cluster on HDInsight
44
Creating a HDInsight Spark Cluster
45
HDInsight Spark Dashboard
46
HDInsight Spark Resource Manager
47
The Resource Manager
enables you to control the
number of cores and amount
of memory allocated to Spark
cluster components and
notebooks.
Increasing the resources
allocated to the Thrift Server
can potentially improve the
performance with BI Tools
Resizing a HDInsight Spark Cluster
48
Notebooks: Jupyter and Zeppelin
49
HDInsight Spark: Jupyter Notebooks
50
A Jupyter notebook showing a Spark program in Python 2
HDInsight Spark: Zeppelin Notebooks
51
Zeppelin notebook must be connected to a
Spark cluster to run
A notebook ‘paragraph’ can be executed
by clicking this icon
Like Jupytr, Zeppelin enables interactive
charts and graphs to be easily included in
a notebook.
You can control the visualization using the
“Settings” drop-down menu
A number of charts and graphs are already
built into the Zeppelin notebook
The Visualization tools: PowerBI
Module 5
Apache Spark SQL and Data Frames
Overview
Spark SQL
• It’s a library built on Spark to support SQL Like operations.
• Facilitate eliminating RDDs from API for simplicity.
• Traditional RDBMS developers can easily transit to Big Data.
• Works with Structured Data that has a Schema.
• Seamlessly mix SQL Queries with Spark Programs.
• Supports JDBC.
• Mix with RDBMS and NoSQL.
Data Frames
Spark Session
• Like SparkContext for RDDs.
• Gives Data Frames and Temp Tables.
Data Frames
• RDDs are for Spark Core while Data Frames are for Spark SQL.
• Built upon RDDs
• It’s a distributed collection of data organized as Rows and Columns.
• Has a schema with column names and column types.
• Interoperability with…
• Collections, CSV, Data Bases, Hive/NoSQL tables, JSON, RDD etc.
Operations on Data Frames
• The filter: Its like ‘where’ clause.
• The join: Its like joins in SQL.
• The groupby: For grouping to get consolidation.
• The agg: Compute aggregation like sum, average.
• Allows mapping and reducing.
• Operations nesting allowed.
Spark SQL
• Spark SQL :
• Not intended for interactive/exploratory analysis.
• Spark SQL reuses the Hive frontend and meta-store.
• Gives full compatibility with existing Hive data, queries, and UDFs.
• Spark SQL includes a cost-based optimizer, columnar storage and code generation to make
queries fast.
• It scales to thousands of nodes and multi hour queries using the Spark engine. Performance
is its biggest advantage.
• Provides full mid-query fault tolerance.
Module 6
Apache Spark Streaming
What is Streaming?
Standing Queries
Query
Logic
Sources Targets
`
Devices, Sensors
Web servers
Pagers &
Monitoring devices
KPI Dashboards
Input
Adapters
Output
AdaptersStreaming Engine
Query
Logic
Query
Logic
Application at Runtime
ATM Security
Why Spark Streaming?
• One of the real powers of Spark
• Typically analytics is performed on data at rest:
• Databases, Flat files. The historical data, survey data etc.
• The real time analytics is performed on data the moment generated
• Complex Event Processing, Fraud detections, click stream processing etc.
• What Spark Stream can do?
• Look at data the moment it is arrived from source.
• Transform, summarize, analyze
• Perform machine learning
• Prediction in real time
Spark Streaming and MicroBatch Processing
Spark Streaming
Credit card fraud detection with high scalability and parallelism.
Spam filtering
Network intrusion detection
Real time social media analytics
Click Stream analytics
Stock market analysis
Advertise analytics
Spark Streaming architecture
Master Node
Driver Program
Spark Context
Stream
Context
Cluster
Manager
Worker Node
Executor
Long
Task Receiver
Input Source
Worker Node
Worker Node
Executor
Tas
k
Cache
Spark Streaming architecture
A master node is with driver program with Spark Context.
Create a streaming context from Spark Context.
One of the worker node is assigned a long task of listening a source.
The receiver keeps receiving data from input source.
The receiver propagate data to worker nodes.
The normal tasks in worker node act upon data.
The DStream
• Discretized stream
• Created from Stream context
• The micro batch window is set up for Dstream (Normally in secords)
• The micro batch window is a small time slice (around 3 sec) in which
generated real time data is accumulated as batch and wrapped in RDD called
Dstream.
• The Dstream allows all RDD operations.
• A common data can be shared across Global Variables
The Dstream windowing functions
• They are for computing across multiple Dstreams.
• All RDD functions are applicable on data accumulated from last X
batches.
• Ex: Accumulate last 3 batches together, Average of something of last 5
batches.
Windowing in Spark
68
1 2 3 4 5 6 7 8 9 10 11 12
First 8 sec Window Second 8 sec Window
Windowing Sample Code
69
Callback OperationWindow Length Sliding IntervalWindow Operation
“Exactly once”, “At Least Once” Guarantees
70
Streaming with Azure Event Hub
71
Azure Event Hub HDInsight Spark Streaming Power BI
Module 7
Apache Spark : Analytics and Machine Learning
Types of Analytics
Descriptive Analytics :
• Defining problem
statement. What exactly
happened.
Exploratory Data
Analytics:
• Why something is
happening.
Inferential Analytics:
• Understand population
from the sample. Take
sample and extrapolate to
whole population.
Predictive Analytics:
• Forecast what will
happen.
Causal Analytics:
• Variables are related.
Understanding effect of
change in one variable to
another variable.
Deep Analytics:
• Analytics uses multi-
source data sources,
combining some or all
above analytics.
Data Analytics is needed everywhere –
Recommendation
engines
Smart meter
monitoringEquipment monitoringAdvertising analysisLife sciences research
Fraud
detection
Healthcare outcomes
Weather forecasting
for business planningOil & Gas exploration
Social network analysis
Churn
analysis
Traffic flow
optimization
IT infrastructure &
Web App optimization
Legal
discovery and
document archivingIntelligence Gathering
Location-based
tracking & services
Pricing Analysis
Personalized Insurance
Machine Learning in Sparks
• Makes ML easy
• Standard and common interface for different ML algorithms.
• Contains algorithms and utilities.
• It has two machine learning libraries.
• The spark.mllib: Original API built on RDDs. May be deprecated soon.
• The spark.ml: New higher level API built on Data Frames.
• The Machine Learning Algorithms of Spark uses these data types…
• Local Vector
• Labeled Point
• Every data to be submitted to ML must be converted to these data types.
Other algorithms supported
• Decision Tree
• Dimensionality Reduction
• Random Forest
• Linear Regression
• Naïve Bayes Classification
• K-Means Clustering
• Recommendation Engines
• …. And many more
Q & A
Contact: chandrashekhardeshpande@synergetics-india.com,
maheshshinde@synergetics-india.com
References
• Online reference:
• http://spark.apache.org/docs/latest/index.html
• http://spark.apache.org/docs/latest/programming-
guide.html
• http://spark.apache.org/docs/latest/api/java/index.html
Thank You

More Related Content

What's hot

Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
inside-BigData.com
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Data Science
Data ScienceData Science
Data Science
Ahmet Bulut
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at Viadeo
Cepoi Eugen
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
Shravan (Sean) Pabba
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Samy Dindane
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark Summit
 
Future of data visualization
Future of data visualizationFuture of data visualization
Future of data visualization
hadoopsphere
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
Abdullah Çetin ÇAVDAR
 

What's hot (20)

Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Data Science
Data ScienceData Science
Data Science
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at Viadeo
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
 
Future of data visualization
Future of data visualizationFuture of data visualization
Future of data visualization
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 

Similar to Apache Spark on HDinsight Training

Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache spark
Apache sparkApache spark
Apache spark
Sameer Mahajan
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Spark
SparkSpark
Apache spark
Apache sparkApache spark
Apache spark
Prashant Pranay
 
Spark Workshop
Spark WorkshopSpark Workshop
Spark Workshop
Navid Kalaei
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 

Similar to Apache Spark on HDinsight Training (20)

Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache spark
Apache sparkApache spark
Apache spark
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Spark core
Spark coreSpark core
Spark core
 
Spark
SparkSpark
Spark
 
Apache spark
Apache sparkApache spark
Apache spark
 
Spark Workshop
Spark WorkshopSpark Workshop
Spark Workshop
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 

More from Synergetics Learning and Cloud Consulting

Introduction to Containers & Diving a little deeper into the benefits of Con...
 Introduction to Containers & Diving a little deeper into the benefits of Con... Introduction to Containers & Diving a little deeper into the benefits of Con...
Introduction to Containers & Diving a little deeper into the benefits of Con...
Synergetics Learning and Cloud Consulting
 
Monitor Cloud Resources using Alerts & Insights
Monitor Cloud Resources using Alerts & InsightsMonitor Cloud Resources using Alerts & Insights
Monitor Cloud Resources using Alerts & Insights
Synergetics Learning and Cloud Consulting
 
Implementing governance in the cloud era
Implementing governance in the cloud eraImplementing governance in the cloud era
Implementing governance in the cloud era
Synergetics Learning and Cloud Consulting
 
Past, Present and Future of DevOps Infrastructure
Past, Present and Future of DevOps InfrastructurePast, Present and Future of DevOps Infrastructure
Past, Present and Future of DevOps Infrastructure
Synergetics Learning and Cloud Consulting
 
The social employee
The social employeeThe social employee
Microsoft Azure New Certification Training roadmap
Microsoft Azure New Certification Training roadmapMicrosoft Azure New Certification Training roadmap
Microsoft Azure New Certification Training roadmap
Synergetics Learning and Cloud Consulting
 
Synergetics Microsoft engagement work
Synergetics Microsoft engagement workSynergetics Microsoft engagement work
Synergetics Microsoft engagement work
Synergetics Learning and Cloud Consulting
 
Deep architectural competency for deploying azure solutions
Deep architectural competency for deploying azure solutionsDeep architectural competency for deploying azure solutions
Deep architectural competency for deploying azure solutions
Synergetics Learning and Cloud Consulting
 
Pre sales engineer
Pre sales engineerPre sales engineer
Synergetics Re-skilling pitch deck
Synergetics Re-skilling pitch deckSynergetics Re-skilling pitch deck
Synergetics Re-skilling pitch deck
Synergetics Learning and Cloud Consulting
 
Synergetics On boarding pitch deck
Synergetics On boarding pitch deckSynergetics On boarding pitch deck
Synergetics On boarding pitch deck
Synergetics Learning and Cloud Consulting
 
Dev ops using Jenkins
Dev ops using JenkinsDev ops using Jenkins
Thank you global azure boot camp 2018, mumbai
Thank you global azure boot camp 2018, mumbaiThank you global azure boot camp 2018, mumbai
Thank you global azure boot camp 2018, mumbai
Synergetics Learning and Cloud Consulting
 
Synergetics Digital Transformation Note
Synergetics Digital Transformation NoteSynergetics Digital Transformation Note
Synergetics Digital Transformation Note
Synergetics Learning and Cloud Consulting
 
Synergetics digital transformation
Synergetics digital transformationSynergetics digital transformation
Synergetics digital transformation
Synergetics Learning and Cloud Consulting
 
Synergetics India Corporate Presentation
Synergetics India Corporate PresentationSynergetics India Corporate Presentation
Synergetics India Corporate Presentation
Synergetics Learning and Cloud Consulting
 
Synergetics Consulting project details
Synergetics Consulting  project detailsSynergetics Consulting  project details
Synergetics Consulting project details
Synergetics Learning and Cloud Consulting
 
Core synergetics presentation 2015-16
Core synergetics presentation 2015-16Core synergetics presentation 2015-16
Core synergetics presentation 2015-16
Synergetics Learning and Cloud Consulting
 

More from Synergetics Learning and Cloud Consulting (20)

Introduction to Containers & Diving a little deeper into the benefits of Con...
 Introduction to Containers & Diving a little deeper into the benefits of Con... Introduction to Containers & Diving a little deeper into the benefits of Con...
Introduction to Containers & Diving a little deeper into the benefits of Con...
 
Monitor Cloud Resources using Alerts & Insights
Monitor Cloud Resources using Alerts & InsightsMonitor Cloud Resources using Alerts & Insights
Monitor Cloud Resources using Alerts & Insights
 
Implementing governance in the cloud era
Implementing governance in the cloud eraImplementing governance in the cloud era
Implementing governance in the cloud era
 
Past, Present and Future of DevOps Infrastructure
Past, Present and Future of DevOps InfrastructurePast, Present and Future of DevOps Infrastructure
Past, Present and Future of DevOps Infrastructure
 
The social employee
The social employeeThe social employee
The social employee
 
Microsoft Azure New Certification Training roadmap
Microsoft Azure New Certification Training roadmapMicrosoft Azure New Certification Training roadmap
Microsoft Azure New Certification Training roadmap
 
Synergetics Microsoft engagement work
Synergetics Microsoft engagement workSynergetics Microsoft engagement work
Synergetics Microsoft engagement work
 
Deep architectural competency for deploying azure solutions
Deep architectural competency for deploying azure solutionsDeep architectural competency for deploying azure solutions
Deep architectural competency for deploying azure solutions
 
Pre sales engineer
Pre sales engineerPre sales engineer
Pre sales engineer
 
Synergetics Re-skilling pitch deck
Synergetics Re-skilling pitch deckSynergetics Re-skilling pitch deck
Synergetics Re-skilling pitch deck
 
Synergetics On boarding pitch deck
Synergetics On boarding pitch deckSynergetics On boarding pitch deck
Synergetics On boarding pitch deck
 
Dev ops using Jenkins
Dev ops using JenkinsDev ops using Jenkins
Dev ops using Jenkins
 
Thank you global azure boot camp 2018, mumbai
Thank you global azure boot camp 2018, mumbaiThank you global azure boot camp 2018, mumbai
Thank you global azure boot camp 2018, mumbai
 
Synergetics Digital Transformation Note
Synergetics Digital Transformation NoteSynergetics Digital Transformation Note
Synergetics Digital Transformation Note
 
Synergetics digital transformation
Synergetics digital transformationSynergetics digital transformation
Synergetics digital transformation
 
Synergetics India Corporate Presentation
Synergetics India Corporate PresentationSynergetics India Corporate Presentation
Synergetics India Corporate Presentation
 
Synergetics Consulting project details
Synergetics Consulting  project detailsSynergetics Consulting  project details
Synergetics Consulting project details
 
Core synergetics presentation 2015-16
Core synergetics presentation 2015-16Core synergetics presentation 2015-16
Core synergetics presentation 2015-16
 
Asap session 2
Asap session 2Asap session 2
Asap session 2
 
Asap session 1
Asap session 1Asap session 1
Asap session 1
 

Recently uploaded

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 

Apache Spark on HDinsight Training

  • 1. Build COMPETENCY across your TEAM Apache Spark On HDInsight Chandrashekhar Deshpande
  • 3. Apache Spark Overview • An Engine to process big data in faster(than MR), easy and extremely scalable way • An Open Source, parallel, in-memory processing, cluster computing framework • Solution for loading, processing and end to end analyzing large scale data • Iterative and Interactive : Scala, Java, Python, R and with Command line interface • Stream Processing (Real time streams and DStreams) • Unifies Big Data with Batch processing, Streaming and Machine Learning • Appreciated and Widely used by: Amazon, eBay, Yahoo • Can very well go with Apache Kafka, ZeroMQ, Cassandra etc. • Powerful platform to implement Lambda and Kappa Architecture
  • 4. Spark Evolution • Recent release of Spark is 2.3 • We will work on 2.1.1
  • 5. Spark - Benefits 5 Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for batch and real-time data processing. Developer Productivity Easy-to-use APIs for processing large datasets. Includes 100+ operators for transforming. Ecosystem Spark has built-in support for many data sources such as HDFS, RDBMS, S3, Apache Hive, Cassandra and MongoDB. Runs on top the Apache YARN resource manager. Unified Engine Integrated framework includes higher-level libraries for interactive SQL queries, processing streaming data, machine learning and graph processing. A single application can combine all types of processing
  • 6. Spark is fast 6 Spark is the current (2014) Sort Benchmark winner. 3x faster than 2013 winner (Hadoop). tinyurl.com/spark-sort
  • 7. … especially for iterative applications 7Logistic regression on a 100-node cluster with 100 GB of data Logistic Regression 140 120 100 80 40 20 0 60 Hadoop Spark 0.9
  • 8. Tuples of MR Vs. RDDs of Spark Tuples in Map Reduce RDDs in Spark
  • 9. Spark in-memory • Spark does all intermediate steps in-memory… • Faster in execution with fewer Secondary Storage r/w. • Memory extensive • The memory objects are called as Resilient Distributed Datasets. • RDD’s are partitioned memory objects existing in multiple worker machines along with their replicas.
  • 10. A unified Framework 10 An unified, open source, parallel, data processing framework for Big Data Analytics Spark Core Engine Spark SQL Interactive Queries Spark Streaming Stream processing Spark MLlib Machine Learning GraphX Graph Computation Yarn Mesos Standalone Scheduler
  • 11. Advantages of Unified Platform 11 Spark Streaming Machine Learning Spark SQL
  • 12. Spark in Lambda Architecture
  • 13. Spark Framework Prog. Inter Library Engine Platforms Storage Scala SQL Spark Core YARN Local Java ML Lib MESOS HDFS Python Graph X Scheduler HBase R Streaming Standalone RDBMS NoSQL AWS S3 Azure Blob Storage / Data Lake Store
  • 14. Spark Modes • Batch mode: A scheduled program executed through scheduler in periodic manner to process data. • Interactive mode: Execute spark commands through Spark Interactive command interface. • The Shell provides default Spark Context and works as a Driver Program. The Spark context runs tasks on Cluster. • Stream mode: Process stream data in real time fashion.
  • 15. Spark Scalability Single Cluster, Stand-alone, Single Box • All components (Driver, Executors) run within same JVM. • Partitions data for multiple core. • Runs as Single Threaded mode. Managed Clustered • Can scale from 2 to 1000 nodes. • Can use different cluster managers like- YARN, MESOS etc. • Partitions data for all nodes.
  • 16. Area of applicability • Data Integration and ETL • Interactive Analytics • High performance batch and micro-batch computations • Advanced and complex analytics • Machine learning • Real time stream processing including IoT • Example: • Market trends and patterns • Predicting sales • Credit card frauds detection • Network intrusion detection • Advertisement targeting • Customer’s 360 analysis
  • 17. Spark – Use cases 17 Use case Description Users Data Integration and ETL Cleansing and combining data from diverse sources Palantir: Data analytics platform Interactive analytics Gain insight from massive data sets tin ad hoc investigations or regularly planned dashboards. Goldman Sachs: Analytics platform Huawei: Query platform in the telecom sector. High performance batch computation Run complex algorithms against large scale data Novartis: Genomic Research MyFitnessPal: Process food data Machine Learning Predict outcomes to make decisions based on input data Alibaba: Marketplace Analysis Spotify: Music Recommendation Real-time stream processing Capturing and processing data continuously with low latency and high reliability Netflix: Recommendation Engine British Gas: Connected Homes
  • 19. Spark Architecture • Spark Master: Manages number of applications. In HDInsight, it also manages resources at cluster level. • Spark Driver: Per application to manage workflow of an application. • Spark Context: Created by driver and keeps track and metadata of RDDs. Gives API to exercise various features of Spark. • Worker Node: Read and write data from and to HDFS/Storage.
  • 20. Spark Execution components • Driver Program • It’s a main initiating program from where Spark operations are defined. It is executed in Master Node. • It controls and co-ordinates all operations. • It defines RDDs. • Each driver program execution is a ‘Job’.
  • 21. Spark Execution components • Spark Context • Provides access to Spark functionalities. • Represents connection to the computing cluster. • Builds, partitions and distributes RDDs to clusters. • Works with Cluster Manager • Splits job as parallel task and executes them on worker nodes • Collects and accumulates results and present them to the driver program
  • 22. Resilient Distributed Datasets • Operations with spark are mostly with RDDs. We create, transform, analyze and store RDDs in Spark operations. • RDDs are fast in access as stored in memory. • Partitioned and Distributed as divided in parts and each part exist in a cluster. • The data sets are formed of Strings, rows, objects, collection. • They are immutable. • To change, apply transformation and create new RDD. • They can be cached and persisted. • Actions produce summarized results.
  • 23. Demo 1 TestSpark010_FirstProgram.java Aim: This program basically demonstrates... 1. Configuring Spark context. 2. Using Resilient Distributed Datasets. 3. Introduction to Mappings, Filtering, Actions etc. 4. Creating String RDD from CSV text file. 5. Applying simple Map to convert strings to Upper case 6. Browser monitoring of Spark Context.
  • 25. Module 3 Working with RDD’s and Paired RDDs in Spark
  • 26. Loading data. • RDD Loading sources… • Text files • JSON files • Sequence files of HDFS • Parallelize on collection • Java Collection • Python Lists • R data frames • RDBMS/NoSQL • Use direct API • Bring it first as Collection (DAO Classes) and then create RDD. • Very large data sets • Create HDFS out side spark and then create RDDs using Apache Sqoop. • Spark aligns its own partitioning techniques to the partitioning of Hadoop.
  • 27. Storing data • Storing of RDD in variety of data sinks. • Text files • JSON • Sequence Files • Collection • RDBMS/NoSQL • For persistence • Spark Utilities • Language specific support Spark power lies in processing data in distribute manner. Though Spark provides API for Loading and sinking of data, here where its real power does not lay. Out side capabilities are recommended for simplicity and performance.
  • 28. Lazy Evaluation • Spark will not load or transform data unless action is encountered. • Step 1: Load a file content into RDD. • Step 2: Apply filtration • Step 3: Count for number of records (Now Step 1 to 3 are executed). • The above statement is true even for interactive mode. • The lazy evaluation helps spark to optimize operations and manage resources in better way. • Makes trouble shooting difficult: Any problem in loading is detected while executing Action.
  • 29. Transformations • Recall: RDDs are immutable. • Transformation: Operation on RDD to create a new RDD. • Examples: Maps, Flat map, Filter etc. • Operate on one element at a time. • Evaluates lazily. • Distributed across multiple nodes and executed by Executor within a cluster on local RDD independently. • Creates its own subset of resultant RDD.
  • 30. Transformations: Maps JavaRDD<String> mutateRDD1 = autoAllData.map(function); • Simulate Map Reduce of Hadoop. • Element level computation or transformation. • The result RDD may have same number of elements as source RDD. • Result type may be different. • In java it allows Lambda expressions or anonymous class. • Scala/Python: Inline function, function reference allowed. • Use cases: • Data standardization of may be names • Data type conversion i.e. from String to Custom object. • Data Computation like tax calculations. • Adding new attributes like calculating Grade. • Data checking, cleansing etc.
  • 31. Transformations: Filters JavaRDD<String> mutateRDD1 = autoAllData.filter(function); • From RDD, it selects element which passes given criterion. • Result in RDD of smaller size than original RDD as some records getting eliminated. • The filter() takes a function which returns Boolean value. • In java it allows Lambda expressions (Predicate) or anonymous class. • Scala/Python: Inline function, function reference allowed.
  • 32. Actions • Acts on entire RDD to reduce to a precise and consolidated result. • Max/Min, Summerization • Spark lazily evaluates all processing on encountering an Action. • Simple Actions • The collect() operation: Converts RDD into collection. • The count() operation: Count number of elements of RDD. • The first() operation: Returns first record as a string. • The take(n) operation: Returns first ‘n’ elements as a List of Strings.
  • 33. Module 4 Apache Spark on Azure HDInsight
  • 34. Apache Spark on HDInsight Azure Storage Azure Data Lake Store Hive and HBase Azure Data Factory Event Hub S P A R K Apache Kafka Apache Flum Orchestration
  • 35. Spark Support on HDInsight 35 Feature Description SLA 99.9% uptime Ease of creating a cluster Possible using Azure Portal, Azure Powershell, Azure Insight .Net SDK. Ease of use For interactive data processing and visualization, Jupyter and Zeppeline notebooks are provided. REST APIs Spark Cluster in HDInsight include Livy, a REST API based Spark Job Server to remotely submit and monitor jobs. Azure DataLake The Azure DataLake Store can be used as primary storage (HDInsight 3.5 onwards) or an additional storage. Integration with Azure services Provides connectors for Azure Event Hub, Kafka. R Server Can setup R Server and run R computations. Concurrent Queries Supports concurrent queries. This enables multiple queries from one user or multiple queries from various users and applications to share the same cluster resources. SSD cache Cache of data either in memory or in SSD for better performance. BI tools integration Connectors available for Power BI and Tableau for data analytics. Machine Learning Libraries Preloaded 200 Anaconda libraries for Machine Learning, data analysis and visualization. Scalability The Cloude’s prominent feature
  • 36. Azure Storage for HDInsight
  • 37. Azure Data Lake Store 37 • Enterprise-wide hyper-scale repository for big data analytic workloads. • Can capture data of any size, type, and ingestion speed in one single place for operational and exploratory analytics. • Hadoop accesses Data Lake Store through Web-HDFS REST API. • Is tuned for performance for data analytics scenarios. • Supports all enterprise-grade capabilities—security, manageability, scalability, reliability, and availability. • Stores variety of data in native format without transformation. Can handle structured, semi-structured, and unstructured data.
  • 39. Azure Storage Vs. Azure Data Lake Azure Data Lake Store Azure Blob Storage General purpose scalable object store Hyper-scale repository for big data analytics workload. Use cases: Batch, interactive, streaming analytics and machine learning data such as log files, IoT data, click streams, large datasets Any type of text or binary data, such as application back end, backup data, media storage for streaming and general purpose data. Folders containing data in files Containers containing data in blobs. Hierarchical file system Object store with flat namespace Authentication: Azure AD identity Account Access Keys, Shared signature Access key Optimized performance for parallel analytical workload Not optimized for analytics workload Size Limit: No limit on file size or number of size. Specific limits as mentioned in document. Geo-Redundancy: Locally redundant. Locally redundant, Globally redundant, Read Access Globally Redundant.
  • 40. Azure Data Factory 40 • A cloud data integration service, to compose data storage, movement, and processing services into automated data pipelines. • Can handle ETL and complex hybrid ETL. • Allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.
  • 42. Azure Event Hub 42 • A highly scalable ingestion system • Can ingest millions of events per second, enabling an application to process and analyze the massive amounts of data produced by your connected devices and applications. • Works as Event Ingestor which works as intermediate between even publisher and event consumer. • Decouples production of Even streams from consumption mechanism. • Enables behavior tracking in mobile apps, traffic information from web farms, in-game event capture in console games, or telemetry collected from industrial machines, connected vehicles, or other devices.
  • 44. Demo 3.1 : Spark cluster on HDInsight 44
  • 45. Creating a HDInsight Spark Cluster 45
  • 47. HDInsight Spark Resource Manager 47 The Resource Manager enables you to control the number of cores and amount of memory allocated to Spark cluster components and notebooks. Increasing the resources allocated to the Thrift Server can potentially improve the performance with BI Tools
  • 48. Resizing a HDInsight Spark Cluster 48
  • 49. Notebooks: Jupyter and Zeppelin 49
  • 50. HDInsight Spark: Jupyter Notebooks 50 A Jupyter notebook showing a Spark program in Python 2
  • 51. HDInsight Spark: Zeppelin Notebooks 51 Zeppelin notebook must be connected to a Spark cluster to run A notebook ‘paragraph’ can be executed by clicking this icon Like Jupytr, Zeppelin enables interactive charts and graphs to be easily included in a notebook. You can control the visualization using the “Settings” drop-down menu A number of charts and graphs are already built into the Zeppelin notebook
  • 53. Module 5 Apache Spark SQL and Data Frames
  • 54. Overview Spark SQL • It’s a library built on Spark to support SQL Like operations. • Facilitate eliminating RDDs from API for simplicity. • Traditional RDBMS developers can easily transit to Big Data. • Works with Structured Data that has a Schema. • Seamlessly mix SQL Queries with Spark Programs. • Supports JDBC. • Mix with RDBMS and NoSQL.
  • 55. Data Frames Spark Session • Like SparkContext for RDDs. • Gives Data Frames and Temp Tables. Data Frames • RDDs are for Spark Core while Data Frames are for Spark SQL. • Built upon RDDs • It’s a distributed collection of data organized as Rows and Columns. • Has a schema with column names and column types. • Interoperability with… • Collections, CSV, Data Bases, Hive/NoSQL tables, JSON, RDD etc.
  • 56. Operations on Data Frames • The filter: Its like ‘where’ clause. • The join: Its like joins in SQL. • The groupby: For grouping to get consolidation. • The agg: Compute aggregation like sum, average. • Allows mapping and reducing. • Operations nesting allowed.
  • 57. Spark SQL • Spark SQL : • Not intended for interactive/exploratory analysis. • Spark SQL reuses the Hive frontend and meta-store. • Gives full compatibility with existing Hive data, queries, and UDFs. • Spark SQL includes a cost-based optimizer, columnar storage and code generation to make queries fast. • It scales to thousands of nodes and multi hour queries using the Spark engine. Performance is its biggest advantage. • Provides full mid-query fault tolerance.
  • 59. What is Streaming? Standing Queries Query Logic Sources Targets ` Devices, Sensors Web servers Pagers & Monitoring devices KPI Dashboards Input Adapters Output AdaptersStreaming Engine Query Logic Query Logic Application at Runtime
  • 61. Why Spark Streaming? • One of the real powers of Spark • Typically analytics is performed on data at rest: • Databases, Flat files. The historical data, survey data etc. • The real time analytics is performed on data the moment generated • Complex Event Processing, Fraud detections, click stream processing etc. • What Spark Stream can do? • Look at data the moment it is arrived from source. • Transform, summarize, analyze • Perform machine learning • Prediction in real time
  • 62. Spark Streaming and MicroBatch Processing
  • 63. Spark Streaming Credit card fraud detection with high scalability and parallelism. Spam filtering Network intrusion detection Real time social media analytics Click Stream analytics Stock market analysis Advertise analytics
  • 64. Spark Streaming architecture Master Node Driver Program Spark Context Stream Context Cluster Manager Worker Node Executor Long Task Receiver Input Source Worker Node Worker Node Executor Tas k Cache
  • 65. Spark Streaming architecture A master node is with driver program with Spark Context. Create a streaming context from Spark Context. One of the worker node is assigned a long task of listening a source. The receiver keeps receiving data from input source. The receiver propagate data to worker nodes. The normal tasks in worker node act upon data.
  • 66. The DStream • Discretized stream • Created from Stream context • The micro batch window is set up for Dstream (Normally in secords) • The micro batch window is a small time slice (around 3 sec) in which generated real time data is accumulated as batch and wrapped in RDD called Dstream. • The Dstream allows all RDD operations. • A common data can be shared across Global Variables
  • 67. The Dstream windowing functions • They are for computing across multiple Dstreams. • All RDD functions are applicable on data accumulated from last X batches. • Ex: Accumulate last 3 batches together, Average of something of last 5 batches.
  • 68. Windowing in Spark 68 1 2 3 4 5 6 7 8 9 10 11 12 First 8 sec Window Second 8 sec Window
  • 69. Windowing Sample Code 69 Callback OperationWindow Length Sliding IntervalWindow Operation
  • 70. “Exactly once”, “At Least Once” Guarantees 70
  • 71. Streaming with Azure Event Hub 71 Azure Event Hub HDInsight Spark Streaming Power BI
  • 72. Module 7 Apache Spark : Analytics and Machine Learning
  • 73. Types of Analytics Descriptive Analytics : • Defining problem statement. What exactly happened. Exploratory Data Analytics: • Why something is happening. Inferential Analytics: • Understand population from the sample. Take sample and extrapolate to whole population. Predictive Analytics: • Forecast what will happen. Causal Analytics: • Variables are related. Understanding effect of change in one variable to another variable. Deep Analytics: • Analytics uses multi- source data sources, combining some or all above analytics.
  • 74. Data Analytics is needed everywhere – Recommendation engines Smart meter monitoringEquipment monitoringAdvertising analysisLife sciences research Fraud detection Healthcare outcomes Weather forecasting for business planningOil & Gas exploration Social network analysis Churn analysis Traffic flow optimization IT infrastructure & Web App optimization Legal discovery and document archivingIntelligence Gathering Location-based tracking & services Pricing Analysis Personalized Insurance
  • 75. Machine Learning in Sparks • Makes ML easy • Standard and common interface for different ML algorithms. • Contains algorithms and utilities. • It has two machine learning libraries. • The spark.mllib: Original API built on RDDs. May be deprecated soon. • The spark.ml: New higher level API built on Data Frames. • The Machine Learning Algorithms of Spark uses these data types… • Local Vector • Labeled Point • Every data to be submitted to ML must be converted to these data types.
  • 76. Other algorithms supported • Decision Tree • Dimensionality Reduction • Random Forest • Linear Regression • Naïve Bayes Classification • K-Means Clustering • Recommendation Engines • …. And many more
  • 77. Q & A Contact: chandrashekhardeshpande@synergetics-india.com, maheshshinde@synergetics-india.com
  • 78. References • Online reference: • http://spark.apache.org/docs/latest/index.html • http://spark.apache.org/docs/latest/programming- guide.html • http://spark.apache.org/docs/latest/api/java/index.html

Editor's Notes

  1. 3 Slides - 5-10 Minute Discussion – Good to understand customers cloud strategy and see if S+S and Leveraging Existing Investments is important to them. Find out what cloud vendors and solutions they are evaluating. For more details on the cloud definitions see wikipedia http://en.wikipedia.org/wiki/Cloud_computing
  2. points = spark.textFile(...).map(parsePoint).cache() w = numpy.random.ranf(size = D) # current separating plane for i in range(ITERATIONS):     gradient = points.map(         lambda p: (1 / (1 + exp(-p.y*(w.dot(p.x)))) - 1) * p.y * p.x     ).reduce(lambda a, b: a + b)     w -= gradient print "Final separating plane: %s" % w
  3. The property graph is a directed multigraph which can have multiple edges in parallel. Every edge and vertex has user defined properties associated with it. The parallel edges allow multiple relationships between the same vertices. Calculating shortest path between two airports. Calculating cheapest travel between two stations etc.
  4. 3 Slides - 5-10 Minute Discussion – Good to understand customers cloud strategy and see if S+S and Leveraging Existing Investments is important to them. Find out what cloud vendors and solutions they are evaluating. For more details on the cloud definitions see wikipedia http://en.wikipedia.org/wiki/Cloud_computing
  5. Master node: Driver Program: A program you write to initiate process. Spark Context: A gate way to all spark functionalities. Worker node: They work as per instruction from Master node. It executes Executor Programs. It is controlled by Cluster manager. The Cluster Manager may be YARN, MESOS or Spark Scheduler. How RDDs are created? The Spark Context reads the records from Data Source and hands they over to Cluster Manager. Cluster Manager partitioned and distribute them to different worker nodes. How transformation is applied to RDDs? The spark context delegate transformation (Job) through cluster manager to the Executors. They are now called as Tasks and are executed in Executors. Executors may create new RDDs as outcome of execution of Transformation. All these RDDs are collected in Master Spark Context.
  6. 3 Slides - 5-10 Minute Discussion – Good to understand customers cloud strategy and see if S+S and Leveraging Existing Investments is important to them. Find out what cloud vendors and solutions they are evaluating. For more details on the cloud definitions see wikipedia http://en.wikipedia.org/wiki/Cloud_computing
  7. 3 Slides - 5-10 Minute Discussion – Good to understand customers cloud strategy and see if S+S and Leveraging Existing Investments is important to them. Find out what cloud vendors and solutions they are evaluating. For more details on the cloud definitions see wikipedia http://en.wikipedia.org/wiki/Cloud_computing
  8. 3 Slides - 5-10 Minute Discussion – Good to understand customers cloud strategy and see if S+S and Leveraging Existing Investments is important to them. Find out what cloud vendors and solutions they are evaluating. For more details on the cloud definitions see wikipedia http://en.wikipedia.org/wiki/Cloud_computing
  9. 3 Slides - 5-10 Minute Discussion – Good to understand customers cloud strategy and see if S+S and Leveraging Existing Investments is important to them. Find out what cloud vendors and solutions they are evaluating. For more details on the cloud definitions see wikipedia http://en.wikipedia.org/wiki/Cloud_computing
  10. Credit card fraud detection: Every time credit card is swiped, a system must within fraction of seconds to analyze for abnormality situation to prevent credit card access from frauds by blocking. Spam filtering: Many mails may hit the email box in unit time. It is essential to apply some parameters to know is it a spam mail. Social media analytics: The data arising through social media need to be analyzed in real time to quickly arrive to some urgent conclusion. Network intrusion detection: To prevent hacking into system by quickly analyzing web logs and system logs. Click stream analysis: When internet user clicks through or browse through web pages, through analytics system need to give some recommendations Stock Market analysis: Very frequent fluctuations in stock market leads to analysis to anticipate and draw some conclusion. Advertise analysis: When search string is given, system needs to quickly show different advertises related to the search string.
  11. 3 Slides - 5-10 Minute Discussion – Good to understand customers cloud strategy and see if S+S and Leveraging Existing Investments is important to them. Find out what cloud vendors and solutions they are evaluating. For more details on the cloud definitions see wikipedia http://en.wikipedia.org/wiki/Cloud_computing