SlideShare a Scribd company logo
1 of 42
Download to read offline
© 2015 IBM Corporation
IBM Analytics for Apache Spark
Rich Tarro & Virender Thakur
IBM Big Data Technical Specialists
WeWork (South Station)
745 Atlantic Ave, Boston, MA
November 19, 2015
© 2015 IBM Corporation2
Analytics and Development Today …. on Spark
complex | disparate | limited flexible | unified | unlimited
What is Spark
© 2015 IBM Corporation3
What Spark isn’t
 A data store – Spark attaches to other data stores but does not
provide its own
 Only for Hadoop – Spark can work with Hadoop (especially
HDFS), but Spark is a separate, standalone system
 Only for machine learning – Spark includes machine learning
and does it very well, but it can handle much broader tasks
equally well
 A replacement for Real Time Streaming Applications – Spark
Streaming employs micro-batching, not true streaming
© 2015 IBM Corporation4
Rapid platform evolution and adoption of Spark
Spark is one of the most active
open source projects
Interest over time (Google Trends)
Job Trends (
© 2015 IBM Corporation5
Spark Capabilities
Log processing TBD
Graph Analytics
Fast and integrated
graph computation
Stream Processing
Near real-time data
processing & analytics
Machine Learning
Incredibly fast, easy to
deploy algorithms
Unified Data Access
Fast, familiar query
language for all data
• Micro-batch event processing for near
real-time analytics
• Process live streams of data (IoT, Twitter,
• No multi-threading or parallel processing
• Predictive and prescriptive analytics,
and smart application design, from
statistical and algorithmic models
• Algorithms are pre-built
• Query your structured data sets with
SQL or other dataframe APIs
• Data mining, BI, and insight discovery
• Get results faster due to performance
• Represent data in a graph
• Represent/analyze systems represented by
nodes and interconnections between them
• Transportation, person to person
relationships, etc.
Spark SQL
© 2015 IBM Corporation6
Spark adds value to any data source
Spark Core
Spark SQL
large variety of data sources and formats can
be supported, both on-premise or cloud
© 2015 IBM Corporation7
 Similar divide-and-conquer architecture
of breaking large jobs into smaller
 General data processing platform
suitable for batch analysis
 Can coexist within existing Hadoop
environments and use Hadoop
components such as HDFS
 Open source with extensive community
How is Spark SIMILAR to
Spark vs. Hadoop
How is Spark DIFFERENT
from Hadoop?
 In-memory architecture vs. file-based for
Hadoop, generates up to 100x speed
 Faster speed enables new use cases such
as interactive or iterative analysis
 Simpler programming model, up to 5x less
 Multiple programming languages
supported, vs. only Java for Hadoop
 Single modular platform enables extension
via libraries, not separate applications
 Specialized machine learning algorithms
© 2015 IBM Corporation8
Key reasons for interest in Spark
Performanance  In-memory architecture greatly reduces disk I/O
 Anywhere from 20-100x faster for common tasks
Productive  Enable various analytic methods that can
process data from a multitude of sources
 Simple but powerful syntax, especially compared
to prior approaches (up to 5x less code)
 Universal programming model across a range of
use cases and steps in data lifecycle
 Integrated with common programming
languages – Java, Python, Scala
 New tools continually reduce skill barrier for
access (e.g. SQL for analysts)
Leverages existing
 Works well within existing Hadoop ecosystem
 Allows customers to extract value out of existing
cloud and on-premise systems
Improves with age  Large and growing community of contributors
continuously improve full analytics stack and
extend capabilities
© 2015 IBM Corporation9
Common Spark use cases
 Interactive querying of very large data sets (e.g. BI)
 Running large data processing batch jobs (e.g. nightly ETL from production
systems, primary Hadoop use case)
 Complex analytics and data mining across various types of data
 Building and deploying rich analytics models (e.g. risk metrics)
 Implementing near-real time stream event processing (e.g. fraud / security
© 2015 IBM Corporation10
 An RDD is a distributed collection of Scala/Python/Java objects of the same type:
 RDD of strings, integers, (key, value) pairs, RDD of class Java/Python/Scala objects
 An RDD is physically distributed across the cluster, but manipulated as one logical entity.
 Suppose we want to know the number of names in the RDD “Names”
 User simply requests: Names.count()
 Spark will “distribute” count processing to all partitions so as to obtain:
• Partition 1: Mokhtar(1), Jacques (1), Dirk (1)  3
• Partition 2: Cindy (1), Dan (1), Susan (1)  3
• Partition 3: Dirk (1), Frank (1), Jacques (1)  3
 Local counts are subsequently aggregated: 3+3+3=9
 To lookup the first element in the RDD: Names.first()
 To display all elements of the RDD: Names.collect() (careful with this)
Resilient Distributed Dataset: (Spark’s basic unit of data)
Partition 1 Partition 2 Partition 3
© 2015 IBM Corporation11
Resilient Distributed Datasets: Creation and Manipulation
 Three methods for creation
 Distributing a collection of objects from the driver program (using the parallelize method of
the spark context)
val rddNumbers = sc.parallelize(1 to 10)
val rddLetters = sc.parallelize (List(“a”, “b”, “c”, “d”))
 Loading an external dataset (file)
val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
 Transformation from another existing RDD
val rddNumbers2 => x+1)
 Dataset from any storage supported by Hadoop
 HDFS, Cassandra, Hbase, Amazon S3, DashDB, Cloudant
 Others
 File types supported
 Text files
 SequenceFiles
 Hadoop InputFormat
© 2015 IBM Corporation12
Resilient Distributed Datasets: Properties
 Two types of operations
 Transformations ~ DDL (Create View V2 as…)
• val rddNumbers = sc.parallelize(1 to 10): Numbers from 1 to 10
• val rddNumbers2 = (x => x+1): Numbers from 2 to 11
• The LINEAGE on how to obtain rddNumbers2 from rddNumber is recorded in DAG
(Direct Acyclic Graph)
• No actual data processing does take place  Lazy evaluations
 Actions ~ DML (Select * From V2…)
• rddNumbers2.collect(): Array [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
• Performs transformations and action
• Returns a value (or write to a file)
 Immutable
 Fault tolerance
 If data in memory is lost it will be recreated from lineage
 Caching, persistence (memory, spilling, disk) and check-pointing
© 2015 IBM Corporation13
Spark Application Architecture
A Spark application is initiated from a driver program
Spark execution modes:
 Standalone with the built-in cluster manager
 Use Mesos as the cluster manager
 Use YARN as the cluster manager
 Standalone cluster on Amazon EC2
© 2015 IBM Corporation14
DataFrames in Spark
• Makes Spark programs simpler and easier to develop and understand
• Distributed collection of data organized into named columns
• APIs in Python, Java, Scala and R (via Spark R)
• Automatically optimized
© 2015 IBM Corporation15
Spark with Data Frames
© 2015 IBM Corporation16
Spark SQL
 Spark module for structured data
 SchemaRDD provide a single interface for efficiently working
with structured data including Apache Hive, Parquet and
JSON files
 Leverages Hive frontend and metastore
 Compatibility with Hive data, queries
and UDFs
 HiveQL limitations may apply
 Not ANSI SQL compliant
 Little to no query rewrite optimization,
automatic memory management or
sophisticated workload management
 Standard connectivity through JDBC/ODBC
© 2015 IBM Corporation17
FYI: Some RDD Transformations
Transformations are lazy evaluations
Returns a pointer to the transformed RDD
Transformation Meaning
map(func) Return a new dataset formed by passing each element of the source through a function
filter(func) Return a new dataset formed by selecting those elements of the source on which func
returns true.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items. So func
should return a Seq rather than a single item
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs
with all pairs of elements for each key.
reduceByKey(func) When called on a dataset of (K, V) pairs, returns a dataset of (K,V) pairs where the
values for each key are aggregated using the given reduce function func
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset
of (K,V) pairs sorted by keys in ascending or descending order.
Full documentation at
© 2015 IBM Corporation18
FYI: Some RDD Actions
Actions returns values
Action Meaning
collect() Return all the elements of the dataset as an array of the driver program. This is usually useful
after a filter or another operation that returns a sufficiently small subset of data.
count() Return the number of elements in a dataset.
first() Return the first element of the dataset
take(n) Return an array with the first n elements of the dataset.
foreach(func) Run a function func on each element of the dataset.
Full documentation at
© 2015 IBM Corporation19
IBM Announces Major Commitment
to Advance Apache® Spark™
© 2015 IBM Corporation20
 Open Source SystemML
 Educate One Million Data Professionals
 Establish Spark Technology Center
 Founding Member of AMPLab
 Contributing to the Core
Our commitment to Spark
© 2015 IBM Corporation21
Big Data University MOOC
Spark Fundamentals I and II
Advanced Spark Development series
Foundational Methodology for Data Science
Partnerships with Databricks, AMPLab, DataCamp and MetiStream
Educate 1 Million Data Scientists and Data Engineers
Our investment to grow skills
© 2015 IBM Corporation22
Our goal is to be the #1 Spark contributor and adopter
 Inspire the use of Spark to solve business problems
 Encourage adoption through open and free educational assets
 Demonstrate real world solutions to identify opportunities
 Use the learning to improve Spark and its application
Spark Technology Center
© 2015 IBM Corporation23
IBM’s IBM Analytics for Apache Spark offering
What it is:
 Fully-managed Spark environment
accessible on-demand
What you get:
 Access to Spark’s next-generation
performance and capabilities,
including built-in machine learning and
other libraries
 Pay only for what you use in either a
pay-as-you-go model or through
dedicated, enterprise instances
 No lock-in – 100% standard Spark
runs on any standard distribution
 Elastic scaling – start with
experimentation, extend to
development and scale to production,
all within the same environment
 Quick start – service is immediately
ready for analysis, skipping setup
hurdles, hassles and time
 Peace of mind – fully managed and
secured, no DBAs or other admins
as a
IBM hosted, managed,
secure environment
© 2015 IBM Corporation24
Jupyter notebook
 Browser-based document that supports code, text,
interactive visualization, math, and media.
 Interactive, iterative, and collaborative work environments
for programming and analytics
 Living documents that are very easy to use by both
technical and LOB users
 Can take you from a concept to deploying an application in
a single environment.
© 2015 IBM Corporation25
Spark RDDs, Data Frames and Spark SQL Demo
NFL 2014
Regular Season
Game Statistics
© 2015 IBM Corporation26
Demo Flow
Load data
 Configure access to object storage
Parse data
 Split CSV file lines by commas
Explore data using only RDDs
 Select only WR data
 Compute average WR receiving yards per game per team
 List top 10 teams in descending order and plot results
Explore data using Data Frames
 List top 10 teams in descending order
Explore data using Spark SQL
 List top 10 teams in descending order
© 2015 IBM Corporation27
Compute average WR receiving yards per game per team (RDDs)
Data contains ‘Team’ and “Position’ and “ReceivingYards’
Select only rows with WRs
 Position = ‘WR’
Create (key/value) pair RDD (or tuple) for each row
 (Team, Receiving Yards)
Transform this tuple’s value to be sequence of values
consisting of Receiving Yards and the value “1”
 (Team, (Receiving Yards, 1))
Reduce by key
 (Team, (Total Receiving Yards for Team, Sum of “1”s))
Compute average for each Team
 Total Receiving Yards for Team / Sum of “1”s
© 2015 IBM Corporation28
Spark Streaming Twitter Demo
© 2015 IBM Corporation29
© 2015 IBM Corporation30
IBM’s IBM Analytics for Apache Spark offering
What it is:
 Fully-managed Spark environment
accessible on-demand
What you get:
 Access to Spark’s next-generation
performance and capabilities,
including built-in machine learning and
other libraries
 Pay only for what you use in either a
pay-as-you-go model or through
dedicated, enterprise instances
 No lock-in – 100% standard Spark
runs on any standard distribution
 Elastic scaling – start with
experimentation, extend to
development and scale to production,
all within the same environment
 Quick start – service is immediately
ready for analysis, skipping setup
hurdles, hassles and time
 Peace of mind – fully managed and
secured, no DBAs or other admins
Client Environment
a service
IBM hosted,
managed, secure
© 2015 IBM Corporation31
Spark RDD and Spark SQL Demos
NFL 2014
Regular Season
Game Statistics
© 2015 IBM Corporation32
Spark Streaming
 Scalable, high-throughput, fault-tolerant stream processing of live data
 Write Spark streaming applications like Spark applications
 Recovers lost work and operator state (sliding windows) out-of-the-box
 Uses HDFS and Zookeeper for high availability
 Data sources also include TCP sockets, ZeroMQ or other customized
data sources
© 2015 IBM Corporation33
Spark Streaming - Internals
 The input stream goes into Spark Steaming
 Breaks up into batches of input data
 Feeds it into the Spark engine for processing
 Generate the final results in streams of batches
 DStream - Discretized Stream
 Represents a continuous stream of data created
from the input streams
 Internally, represented as a sequence of RDDs
© 2015 IBM Corporation34
Spark Streaming – Getting Started
# Create a local StreamingContext with two working thread and batch interval of 1
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 1)
# Create a DStream that will connect to hostname:port, like localhost:9999
lines = ssc.socketTextStream("localhost", 9999)
# Split each line into words
words = lines.flatMap(lambda line: line.split(" "))
# Count each word in each batch
pairs = word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
# Print the first ten elements of each RDD generated in this DStream to the console
© 2015 IBM Corporation35
Spark Streaming – Getting Started (continued)
# Start the computation
# Wait for the computation to terminate
$ ./bin/spark-submit examples/src/main/python/streaming/
localhost 9999
Time: 2014-10-14 15:25:21
© 2015 IBM Corporation36
Spark Streaming Twitter Demo
© 2015 IBM Corporation37
Spark R
 Spark R is an R package that provides a light-weight
front-end to use Apache Spark from R
 Spark R exposes the Spark API through
the RDD class and allows users to interactively run
jobs from the R shell on a cluster.
 Goal
 Make Spark R production ready
 Integration with MLlib
 Consolidations to the data frame and RDD
© 2015 IBM Corporation38
Spark R Demo
© 2015 IBM Corporation39
Spark MLlib
 Spark MLlib for machine learning
 Marked as under active
 Provides common algorithm and
• Classification
• Regression
• Clustering
• Collaborative filtering
• Dimensionality reduction
 Leverages iteration and yields
better results
than one-pass approximations
used with MapReduce
© 2015 IBM Corporation40
Spark GraphX
 Flexible Graphing
GraphX unifies ETL, exploratory analysis, and
iterative graph computation
You can view the same data as both graphs and
collections, transform and join graphs with RDDs
efficiently, and write custom iterative graph
algorithms with the API
 Speed
Comparable performance to the fastest
specialized graph processing systems.
 Algorithms
Choose from a growing library of graph algorithms
In addition to a highly flexible API, GraphX comes
with a variety of graph algorithms
© 2015 IBM Corporation41
The “Learning Spark” O’Reilly book
Apache Spark at Bluemix
Workbench – Data Scientist Workbench
The following course on big data university
© 2015 IBM Corporation42

More Related Content

What's hot

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterSri Ambati
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed DatasetsGabriele Modena
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
Tachyon-2014-11-21-amp-camp5Haoyuan Li
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RaySpark Summit
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

What's hot (20)

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Dplyr packages
Dplyr packagesDplyr packages
Dplyr packages
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
Spark 101
Spark 101Spark 101
Spark 101
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)

Viewers also liked (6)

How to be Successful without growing up
How to be Successful without growing upHow to be Successful without growing up
How to be Successful without growing up
Pcpld 2016 pcpld conference nov 2016 heslop
Pcpld 2016 pcpld conference nov 2016 heslopPcpld 2016 pcpld conference nov 2016 heslop
Pcpld 2016 pcpld conference nov 2016 heslop
Pcpld 2016 masmc offline presentation
Pcpld 2016 masmc offline presentationPcpld 2016 masmc offline presentation
Pcpld 2016 masmc offline presentation
Have You Set Your Marriage Goals in 2017?
Have You Set Your Marriage Goals in 2017?Have You Set Your Marriage Goals in 2017?
Have You Set Your Marriage Goals in 2017?
Grow Your Marriage in 2017
Grow Your Marriage in 2017Grow Your Marriage in 2017
Grow Your Marriage in 2017
Android (1)
Android (1)Android (1)
Android (1)

Similar to Boston Spark Meetup event Slides Update

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 Andrey Vykhodtsev
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
Big data
Big data vahidamiri-tabriz-13960226-datastack.irBig data
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa

Similar to Boston Spark Meetup event Slides Update (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Big data
Big data vahidamiri-tabriz-13960226-datastack.irBig data
Big data
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup

Boston Spark Meetup event Slides Update

  • 1. © 2015 IBM Corporation IBM Analytics for Apache Spark Rich Tarro & Virender Thakur IBM Big Data Technical Specialists WeWork (South Station) 745 Atlantic Ave, Boston, MA November 19, 2015
  • 2. © 2015 IBM Corporation2 Analytics and Development Today …. on Spark complex | disparate | limited flexible | unified | unlimited What is Spark
  • 3. © 2015 IBM Corporation3 What Spark isn’t  A data store – Spark attaches to other data stores but does not provide its own  Only for Hadoop – Spark can work with Hadoop (especially HDFS), but Spark is a separate, standalone system  Only for machine learning – Spark includes machine learning and does it very well, but it can handle much broader tasks equally well  A replacement for Real Time Streaming Applications – Spark Streaming employs micro-batching, not true streaming
  • 4. © 2015 IBM Corporation4 Rapid platform evolution and adoption of Spark Spark is one of the most active open source projects Interest over time (Google Trends) Source: Job Trends (
  • 5. © 2015 IBM Corporation5 Spark Capabilities Log processing TBD Graph Analytics Fast and integrated graph computation Stream Processing Near real-time data processing & analytics Machine Learning Incredibly fast, easy to deploy algorithms Unified Data Access Fast, familiar query language for all data • Micro-batch event processing for near real-time analytics • Process live streams of data (IoT, Twitter, Kafka) • No multi-threading or parallel processing required • Predictive and prescriptive analytics, and smart application design, from statistical and algorithmic models • Algorithms are pre-built • Query your structured data sets with SQL or other dataframe APIs • Data mining, BI, and insight discovery • Get results faster due to performance • Represent data in a graph • Represent/analyze systems represented by nodes and interconnections between them • Transportation, person to person relationships, etc. SparkCore Spark SQL Spark Streaming MLlib (machine learning) GraphX (graph)
  • 6. © 2015 IBM Corporation6 Spark adds value to any data source Spark Core Spark SQL Spark Streaming MLlib (machine learning) GraphX (graph) large variety of data sources and formats can be supported, both on-premise or cloud BigInsights (HDFS) Cloudant dashDB Object Storage SQL DB …many others IBM CLOUD OTHER CLOUD CLOUD APPS ON-PREMISE
  • 7. © 2015 IBM Corporation7 7  Similar divide-and-conquer architecture of breaking large jobs into smaller pieces  General data processing platform suitable for batch analysis  Can coexist within existing Hadoop environments and use Hadoop components such as HDFS  Open source with extensive community support How is Spark SIMILAR to Hadoop? Spark vs. Hadoop How is Spark DIFFERENT from Hadoop?  In-memory architecture vs. file-based for Hadoop, generates up to 100x speed improvements  Faster speed enables new use cases such as interactive or iterative analysis  Simpler programming model, up to 5x less code  Multiple programming languages supported, vs. only Java for Hadoop  Single modular platform enables extension via libraries, not separate applications  Specialized machine learning algorithms available
  • 8. © 2015 IBM Corporation8 Key reasons for interest in Spark Performanance  In-memory architecture greatly reduces disk I/O  Anywhere from 20-100x faster for common tasks Productive  Enable various analytic methods that can process data from a multitude of sources  Simple but powerful syntax, especially compared to prior approaches (up to 5x less code)  Universal programming model across a range of use cases and steps in data lifecycle  Integrated with common programming languages – Java, Python, Scala  New tools continually reduce skill barrier for access (e.g. SQL for analysts) Leverages existing investments  Works well within existing Hadoop ecosystem  Allows customers to extract value out of existing cloud and on-premise systems Improves with age  Large and growing community of contributors continuously improve full analytics stack and extend capabilities
  • 9. © 2015 IBM Corporation9 Common Spark use cases  Interactive querying of very large data sets (e.g. BI)  Running large data processing batch jobs (e.g. nightly ETL from production systems, primary Hadoop use case)  Complex analytics and data mining across various types of data  Building and deploying rich analytics models (e.g. risk metrics)  Implementing near-real time stream event processing (e.g. fraud / security detection) 1 2 3 4 5
  • 10. © 2015 IBM Corporation10  An RDD is a distributed collection of Scala/Python/Java objects of the same type:  RDD of strings, integers, (key, value) pairs, RDD of class Java/Python/Scala objects  An RDD is physically distributed across the cluster, but manipulated as one logical entity.  Suppose we want to know the number of names in the RDD “Names”  User simply requests: Names.count()  Spark will “distribute” count processing to all partitions so as to obtain: • Partition 1: Mokhtar(1), Jacques (1), Dirk (1)  3 • Partition 2: Cindy (1), Dan (1), Susan (1)  3 • Partition 3: Dirk (1), Frank (1), Jacques (1)  3  Local counts are subsequently aggregated: 3+3+3=9  To lookup the first element in the RDD: Names.first()  To display all elements of the RDD: Names.collect() (careful with this) Resilient Distributed Dataset: (Spark’s basic unit of data) Mokhtar Jacques Dirk Cindy Dan Susan Dirk Frank Jacques Partition 1 Partition 2 Partition 3 Names
  • 11. © 2015 IBM Corporation11 Resilient Distributed Datasets: Creation and Manipulation  Three methods for creation  Distributing a collection of objects from the driver program (using the parallelize method of the spark context) val rddNumbers = sc.parallelize(1 to 10) val rddLetters = sc.parallelize (List(“a”, “b”, “c”, “d”))  Loading an external dataset (file) val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")  Transformation from another existing RDD val rddNumbers2 => x+1)  Dataset from any storage supported by Hadoop  HDFS, Cassandra, Hbase, Amazon S3, DashDB, Cloudant  Others  File types supported  Text files  SequenceFiles  Hadoop InputFormat
  • 12. © 2015 IBM Corporation12 Resilient Distributed Datasets: Properties  Two types of operations  Transformations ~ DDL (Create View V2 as…) • val rddNumbers = sc.parallelize(1 to 10): Numbers from 1 to 10 • val rddNumbers2 = (x => x+1): Numbers from 2 to 11 • The LINEAGE on how to obtain rddNumbers2 from rddNumber is recorded in DAG (Direct Acyclic Graph) • No actual data processing does take place  Lazy evaluations  Actions ~ DML (Select * From V2…) • rddNumbers2.collect(): Array [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] • Performs transformations and action • Returns a value (or write to a file)  Immutable  Fault tolerance  If data in memory is lost it will be recreated from lineage  Caching, persistence (memory, spilling, disk) and check-pointing
  • 13. © 2015 IBM Corporation13 Spark Application Architecture A Spark application is initiated from a driver program Spark execution modes:  Standalone with the built-in cluster manager  Use Mesos as the cluster manager  Use YARN as the cluster manager  Standalone cluster on Amazon EC2
  • 14. © 2015 IBM Corporation14 DataFrames in Spark • Makes Spark programs simpler and easier to develop and understand • Distributed collection of data organized into named columns • APIs in Python, Java, Scala and R (via Spark R) • Automatically optimized
  • 15. © 2015 IBM Corporation15 Spark with Data Frames
  • 16. © 2015 IBM Corporation16 Spark SQL  Spark module for structured data  SchemaRDD provide a single interface for efficiently working with structured data including Apache Hive, Parquet and JSON files  Leverages Hive frontend and metastore  Compatibility with Hive data, queries and UDFs  HiveQL limitations may apply  Not ANSI SQL compliant  Little to no query rewrite optimization, automatic memory management or sophisticated workload management  Standard connectivity through JDBC/ODBC
  • 17. © 2015 IBM Corporation17 FYI: Some RDD Transformations Transformations are lazy evaluations Returns a pointer to the transformed RDD Transformation Meaning map(func) Return a new dataset formed by passing each element of the source through a function func. filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true. flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items. So func should return a Seq rather than a single item join(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. reduceByKey(func) When called on a dataset of (K, V) pairs, returns a dataset of (K,V) pairs where the values for each key are aggregated using the given reduce function func sortByKey([ascending],[nu mTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K,V) pairs sorted by keys in ascending or descending order. Full documentation at
  • 18. © 2015 IBM Corporation18 FYI: Some RDD Actions Actions returns values Action Meaning collect() Return all the elements of the dataset as an array of the driver program. This is usually useful after a filter or another operation that returns a sufficiently small subset of data. count() Return the number of elements in a dataset. first() Return the first element of the dataset take(n) Return an array with the first n elements of the dataset. foreach(func) Run a function func on each element of the dataset. Full documentation at
  • 19. © 2015 IBM Corporation19 IBM Announces Major Commitment to Advance Apache® Spark™
  • 20. © 2015 IBM Corporation20 Announcing  Open Source SystemML  Educate One Million Data Professionals  Establish Spark Technology Center  Founding Member of AMPLab  Contributing to the Core Our commitment to Spark
  • 21. © 2015 IBM Corporation21 Big Data University MOOC Spark Fundamentals I and II Advanced Spark Development series Foundational Methodology for Data Science Partnerships with Databricks, AMPLab, DataCamp and MetiStream Educate 1 Million Data Scientists and Data Engineers Our investment to grow skills
  • 22. © 2015 IBM Corporation22 Our goal is to be the #1 Spark contributor and adopter  Inspire the use of Spark to solve business problems  Encourage adoption through open and free educational assets  Demonstrate real world solutions to identify opportunities  Use the learning to improve Spark and its application Spark Technology Center
  • 23. © 2015 IBM Corporation23 IBM’s IBM Analytics for Apache Spark offering What it is:  Fully-managed Spark environment accessible on-demand What you get:  Access to Spark’s next-generation performance and capabilities, including built-in machine learning and other libraries  Pay only for what you use in either a pay-as-you-go model or through dedicated, enterprise instances  No lock-in – 100% standard Spark runs on any standard distribution  Elastic scaling – start with experimentation, extend to development and scale to production, all within the same environment  Quick start – service is immediately ready for analysis, skipping setup hurdles, hassles and time  Peace of mind – fully managed and secured, no DBAs or other admins necessary as a service IBM hosted, managed, secure environment
  • 24. © 2015 IBM Corporation24 Jupyter notebook  Browser-based document that supports code, text, interactive visualization, math, and media.  Interactive, iterative, and collaborative work environments for programming and analytics  Living documents that are very easy to use by both technical and LOB users  Can take you from a concept to deploying an application in a single environment.
  • 25. © 2015 IBM Corporation25 Spark RDDs, Data Frames and Spark SQL Demo NFL 2014 Regular Season Player Game Statistics Dataset
  • 26. © 2015 IBM Corporation26 Demo Flow Load data  game/  Configure access to object storage Parse data  Split CSV file lines by commas Explore data using only RDDs  Select only WR data  Compute average WR receiving yards per game per team  List top 10 teams in descending order and plot results Explore data using Data Frames  List top 10 teams in descending order Explore data using Spark SQL  List top 10 teams in descending order
  • 27. © 2015 IBM Corporation27 Compute average WR receiving yards per game per team (RDDs) Data contains ‘Team’ and “Position’ and “ReceivingYards’ columns Select only rows with WRs  Position = ‘WR’ Create (key/value) pair RDD (or tuple) for each row  (Team, Receiving Yards) Transform this tuple’s value to be sequence of values consisting of Receiving Yards and the value “1”  (Team, (Receiving Yards, 1)) Reduce by key  (Team, (Total Receiving Yards for Team, Sum of “1”s)) Compute average for each Team  Total Receiving Yards for Team / Sum of “1”s
  • 28. © 2015 IBM Corporation28 Spark Streaming Twitter Demo
  • 29. © 2015 IBM Corporation29
  • 30. © 2015 IBM Corporation30 IBM’s IBM Analytics for Apache Spark offering What it is:  Fully-managed Spark environment accessible on-demand What you get:  Access to Spark’s next-generation performance and capabilities, including built-in machine learning and other libraries  Pay only for what you use in either a pay-as-you-go model or through dedicated, enterprise instances  No lock-in – 100% standard Spark runs on any standard distribution  Elastic scaling – start with experimentation, extend to development and scale to production, all within the same environment  Quick start – service is immediately ready for analysis, skipping setup hurdles, hassles and time  Peace of mind – fully managed and secured, no DBAs or other admins necessary Client Environment as a service IBM hosted, managed, secure environment Apps Data Env Data Result Request
  • 31. © 2015 IBM Corporation31 Spark RDD and Spark SQL Demos NFL 2014 Regular Season Player Game Statistics Dataset
  • 32. © 2015 IBM Corporation32 Spark Streaming  Scalable, high-throughput, fault-tolerant stream processing of live data streams  Write Spark streaming applications like Spark applications  Recovers lost work and operator state (sliding windows) out-of-the-box  Uses HDFS and Zookeeper for high availability  Data sources also include TCP sockets, ZeroMQ or other customized data sources
  • 33. © 2015 IBM Corporation33 Spark Streaming - Internals  The input stream goes into Spark Steaming  Breaks up into batches of input data  Feeds it into the Spark engine for processing  Generate the final results in streams of batches  DStream - Discretized Stream  Represents a continuous stream of data created from the input streams  Internally, represented as a sequence of RDDs
  • 34. © 2015 IBM Corporation34 Spark Streaming – Getting Started # Create a local StreamingContext with two working thread and batch interval of 1 second sc = SparkContext("local[2]", "NetworkWordCount") ssc = StreamingContext(sc, 1) # Create a DStream that will connect to hostname:port, like localhost:9999 lines = ssc.socketTextStream("localhost", 9999) # Split each line into words words = lines.flatMap(lambda line: line.split(" ")) # Count each word in each batch pairs = word: (word, 1)) wordCounts = pairs.reduceByKey(lambda x, y: x + y) # Print the first ten elements of each RDD generated in this DStream to the console WordCounts.pprint()
  • 35. © 2015 IBM Corporation35 Spark Streaming – Getting Started (continued) # Start the computation ssc.start() # Wait for the computation to terminate ssc.awaitTermination() # RUNNING $ ./bin/spark-submit examples/src/main/python/streaming/ localhost 9999 ... ------------------------------------------- Time: 2014-10-14 15:25:21 ------------------------------------------- (hello,1) (world,1) ...
  • 36. © 2015 IBM Corporation36 Spark Streaming Twitter Demo
  • 37. © 2015 IBM Corporation37 Spark R  Spark R is an R package that provides a light-weight front-end to use Apache Spark from R  Spark R exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster.  Goal  Make Spark R production ready  Integration with MLlib  Consolidations to the data frame and RDD concepts
  • 38. © 2015 IBM Corporation38 Spark R Demo
  • 39. © 2015 IBM Corporation39 Spark MLlib  Spark MLlib for machine learning library  Marked as under active development  Provides common algorithm and utilities • Classification • Regression • Clustering • Collaborative filtering • Dimensionality reduction  Leverages iteration and yields better results than one-pass approximations sometimes used with MapReduce
  • 40. © 2015 IBM Corporation40 Spark GraphX  Flexible Graphing GraphX unifies ETL, exploratory analysis, and iterative graph computation You can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently, and write custom iterative graph algorithms with the API  Speed Comparable performance to the fastest specialized graph processing systems.  Algorithms Choose from a growing library of graph algorithms In addition to a highly flexible API, GraphX comes with a variety of graph algorithms
  • 41. © 2015 IBM Corporation41 Resources The “Learning Spark” O’Reilly book Apache Spark at Bluemix Workbench – Data Scientist Workbench The following course on big data university
  • 42. © 2015 IBM Corporation42