Introduction to apache spark

Introduction to Spark
Eric Eijkelenboom - UserReport - userreport.com

• What is Spark and why should I care?
• Architecture and programming model
• Examples
• Mini demo
• Related projects

RTFM
• A general-purpose computation framework that leverages distributed
memory
• More ﬂexible than MapReduce (it supports general execution graphs)
• Linear scalability and fault tolerance
• It supports a rich set of higher-level tools including
• Shark (Hive on Spark) and Spark SQL
• MLlib for machine learning
• GraphX for graph processing
• Spark Streaming

!
!
!
!
• Slow due to serialisation & replication
• Inefﬁcient for iterative computing & interactive
querying
Limitations of MapReduce
Input iter. 1 iter. 2 . . .
HDFS 
read
HDFS 
write
HDFS 
read
HDFS 
write
Map
Map
Map
Reduce
Reduce
Input Output

Leveraging memory
iter. 1 iter. 2 . . .
Input
HDFS 
read
HDFS 
write
HDFS 
read
HDFS 
write

Leveraging memory
iter. 1 iter. 2 . . .
Input
iter. 1 iter. 2 . . .
Input
HDFS 
read
HDFS 
write
HDFS 
read
HDFS 
write

Leveraging memory
iter. 1 iter. 2 . . .
Input
iter. 1 iter. 2 . . .
Input
HDFS 
read
HDFS 
write
HDFS 
read
HDFS 
write
Not tied to 2-stage
MapReduce paradigm
1. Extract a working set
2. Cache it
3. Query it repeatedly

So, Spark is…
• In-memory analytics, many times faster than
Hadoop/Hive
• Designed for running iterative algorithms &
interactive querying
• Highly compatible with Hadoop’s Storage APIs
• Can run on your existing Hadoop Cluster Setup
• Programming in Scala, Python or Java

Architecture
HDFS
Datanode Datanode Datanode....
Spark Worker Spark Worker Spark Worker
....
Cache Cache Cache
Block Block Block
Cluster Manager
Spark Driver (Master)

Architecture
HDFS
Datanode Datanode Datanode....
Spark Worker Spark Worker Spark Worker
....
Cache Cache Cache
Block Block Block
Cluster Manager
Spark Driver (Master)
• YARN
• Mesos
• Standalone

Programming model
• Resilient Distributed Datasets (RDDs) are basic building blocks
• Distributed collection of objects, cached in-memory across
cluster nodes
• Automatically rebuilt on failure
• RDD operations
• Transformations: create new RDDs from existing ones
• Actions: return a value to the master node after running a
computation on the dataset

As you know…
• … Hadoop is a distributed system for counting
words
• Here is how it’s done is Spark

As you know…
• … Hadoop is a distributed system for counting
words
• Here is how it’s done is Spark
Blue code: Spark operations
Red code: functions
(closures) that get passed
to the cluster automatically

Text search
In memory text search:
!
!
caches the RDD in memory for faster reuse

Logistic regression
!
• 100 GB of data on a 100 node cluster

Hive on Spark = Shark
• A large scale data warehouse system just like Hive
• Highly compatible with Hive (HQL, metastore,
serialization formats, and UDFs)
• Built on top of Spark (thus a faster execution engine)
• Provision of creating in-memory materialized tables
(Cached Tables)
• And cached tables utilise columnar storage instead of
raw storage

Shark
Shark uses the existing Hive client and metastore

MLlib
• Machine learning library based on Spark
!
!
• Supports a range of machine learning algorithms,
including classiﬁcation, regression, clustering,
collaborative ﬁltering, dimensionality reduction, and
more

Spark Streaming
• Write streaming applications in the same way as
batch applications
• Reuse code between batch processing and
streaming
• Write more than analytics applications:
• Join streams against historical data
• Run ad-hoc queries on stream state

Spark Streaming
• Count tweets on a sliding window
!
!
• Find words with higher frequency than historic data

Introduction to apache spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Introduction to apache spark

Similar to Introduction to apache spark (20)

Recently uploaded

Recently uploaded (20)

Introduction to apache spark