Apache Spark: The Next Gen toolset for Big Data Processing

Prajod Vettiyattil
Architect, Open source
Wipro
in.linkedin.com/in/prajod
@prajods
Apache Spark The Next Gen toolset for Big Data Processing
Namitha M S
Architect, Advanced Technologies
Wipro
in.linkedin.com/in/namithams
Open Source India
Nov 2014
Bangalore

•Big Data
•Hadoop stack and its limitations
•Spark: An overview
•Streaming, GraphX and MLlib
•Performance characteristics of Spark
Agenda

•Data too huge for normal systems
•3 Vs: Volume, Variety, Velocity
•Storage challenge
•Analysis challenge
•Query results take hours, days or months
Big Data
Data disks

The Big Data Analysis Triad
Batch
Interactive
Streaming

The Hadoop stack
•Distributed data processing
•Fault tolerant
•Process peta byte data sets
•Ecosystem tools
•Hive DB, Hbase
•Pig
•Storm
•Hadoop
•Map
•Reduce
•Shuffle, partition, sort
•HDFS

Hadoop: Data flow
Partition for target reducers
Buffer in memory
Map
Input data files
Sort each partition by key
Merge all partitions and write to disk
Potential spill to disk
Merge round 1
Merge round 2
Merge round N
http fetch from
map node
Reduce
Merge sort
…
Output
High disk I/O
On Map nodes
On Reduce nodes

•Batch mode
•Only the batch layer in the Lambda pattern
•No real time
•No repetitive queries
•Iterative algorithms
•Interactive data querying
•Poor support for distributed memory
Limitations of Hadoop

Spark: An overview
•“Over time, fewer projects will use MapReduce, and more will use Spark”
•Doug Cutting, creator of Hadoop
•New architecture: scale better and simplify
•In memory processing for Big Data
•Cached intermediate data sets
•Multi-step DAG based execution
•Resilient Distributed Data(RDD) sets
•The core innovation in Spark

Spark Ecosystem tools
Apache Spark
Spark SQL
Streaming
MLlib
GraphX
Spark R
Blink DB
Shark
Bagel

DAG Execution Engine
Map
Collect
Filter
Map
Reduce
Sort
Collect
DAG = Directed Acyclic Graph

•Resilient Distributed Data sets
•Features
•Read only
•Fault tolerance without replication
•Uses data lineage for recovery
•Low network I/O
•Partitions/Slices
•parallel tasks
RDD
Disk
Transform 1
RDD 1
Transform 2
RDD 2
Data partitions

Lambda architecture pattern
•Used for Lambda architecture implementation
•Batch layer
•Speed layer
•Serving layer
Batch Layer
Speed Layer
Serving Layer
Input
Data consumers
Query
Query

Spark Streaming
•For stream processing in Spark
•Real time data
•Like Twitter queries
•Discretized streams(DStreams)
•Micro batches
•Sequence of RDDs

Discretized Streams
Spark Streaming
Spark
Batches of x seconds
Input
Output

Why Spark Streaming
•Near real time processing (0.5 – 2 sec latency)
•Parallel recovery of lost nodes and stragglers
•Implementation of Lambda architecture
•Single engine for batch and stream
•Not suited for low latency requirements
•i.e., 100ms

Apache Storm vs Spark Streaming
Feature
Spark Streaming
Storm
Processing Model
Micro-Batching
Event Stream processing
Message Delivery options
Inherently fault tolerant, exactly once delivery
At least once, at most once, exactly once
Flexibility
Coarse grained transformation
Fine grained transformation
Implemented in
Scala
Clojure
Development Cost
Common platform for both batch and stream
Only stream – separate setup for batch
Applicability
Machine learning, Interactive analytics, near real time analytics
Near real time analytics, Natural language processing

GraphX & MLlib
• Data parallel Vs Graph Parallel processing
• Wikipedia search vs Facebook connection search, Page
rank
• Spark MLlib implements high quality machine
learning algorithms
• Iterative Algorithm Paradigm
• Leverage Spark’s in memory data sets
( ) (t 1) t x  f x 
f(xt) f(xt+1)
x(t) x(t+1)

Performance characteristics
Performance of Spark
•100x faster in memory
•10x faster on disk
Graph courtesy: spark.apache.org

Hadoop vs Spark
Hadoop
Spark
Spark
World Record
100 TB *
1 PB
Data Size
102.5 TB
100 TB
1000 TB
Elapsed Time
72 mins
23 mins
234 mins
# Nodes
2100
206
190
# Cores
50400
6592
6080
# Reducers
10,000
29,000
250,000
Rate
1.42 TB/min
4.27 TB/min
4.27 TB/min
Rate/node
0.67 GB/min
20.7 GB/min
22.5 GB/min
Data courtesy: databricks.com

1 TB performance test: data per sec

1 TB performance test data rate vs RAM size

Apache Spark
•New architecture
•RDD, DAG
•In memory processing
•Map reduce and more
•GraphX
•MLlib
•Spark streaming
Summary
Ecosystem tools
•Spark R
•Blink DB
•Storm
Spark performance
•GBs per second
•RAM to data size
•Inflexion point

Questions
Prajod Vettiyattil
Architect, Open source
Wipro
@prajods
in.linkedin.com/in/prajod
Namitha M S
Architect, Advanced Technologies
Wipro
in.linkedin.com/in/namithams

Apache Spark: The Next Gen toolset for Big Data Processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Apache Spark: The Next Gen toolset for Big Data Processing

Similar to Apache Spark: The Next Gen toolset for Big Data Processing (20)

Recently uploaded

Recently uploaded (20)

Apache Spark: The Next Gen toolset for Big Data Processing