SlideShare a Scribd company logo
1 of 63
© 2015 IBM Corporation
Apache Hadoop Day 2015
Intro to Apache Spark
LIGHTENING FAST CLUSTER COMPUTING
© 2015 IBM Corporation
© 2015 IBM Corporation
Apache Hadoop Day 2015
Mapreduce Limitations
• Lots of boilerplate , makes it
complex to program in MR.
• Disk based approach not good for
iterative usecases.
• Batch processing not fit for real
time.
In short no single solution, people build
specialized systems as workarounds.
© 2015 IBM Corporation
Spark Goal
Batch
Interactiv
e
Streamin
g
Single
Framework!
Support batch, streaming, and interactive computations…
… in a unified framework
Easy to develop sophisticated algorithms (e.g., graph, ML algos)
© 2015 IBM Corporation
Spark Core
Supports Scala,Java,Python,R
Spark Core
Supports Scala,Java,Python,R
Spark SQL
Interactive
Spark SQL
Interactive
Spark
Streaming
realtime
Spark
Streaming
realtime
Mlib/Spark.ml
Machine learning
Mlib/Spark.ml
Machine learning
GraphX
Graph processing
GraphX
Graph processing
Spark Stack
Unified engine across diverse workload and
environments
© 2015 IBM Corporation
Data processing landscape
GraphLa
b
Girap
h
…
Graph
Grap
h
Graph
Pre
gel
Googl
e
Apache
Dato
© 2015 IBM Corporation
Data processing landscape
Dreme
l
GraphLa
b
Girap
h
Drill
Impala
…
SQ
L
Graph
Grap
h
SQ
L
Graph
Pre
gel
Googl
e
Google
SQL
Apach
e
Apache
Cloudera
Dato
© 2015 IBM Corporation
Data processing landscape
Dreme
l
GraphLa
b
Girap
h
Drill
Impala
…
SQ
L
Graph
Grap
h
SQ
L
Graph
Pre
gel
Googl
e
Google
SQL
Apach
e
Apache
DAG
Tez
Apache
Cloudera
Stream
Stor
m
Apache
Dato
© 2015 IBM Corporation
Data processing landscape
Dr
e
aphLa
b
Girap
h
Graph
Grap
h
G
Pregel
Go Apache
DAG
ez
Apache
raph
ogle SQL
Drill
Apache
SQL
mel T
Google
SQL
Impala
Cloudera
Storm
Stream … Gr
Apache
Dato
Stop surarmement now !
© 2015 IBM Corporation
Spark

Unifies batch, streaming, interactive comp.

Easy to build sophisticated applications

Support iterative, graph-parallel algorithms

Powerful APIs in Scala, Python, Java
Spark
Spark
Streaming Shark SQL
BlinkDB
GraphX MLlib
Streami
ng
Batch,
Interactiv
e
Batch,
Interactive
Interacti
ve
Data-parallel,
Iterative
Sophisticated
algos.
© 2015 IBM Corporation
MapReduce Vs Spark

Mapreduce run each task in its own process, when
tasks completed the process dies
(MultithreadedMapper)

In Spark by default many tasks are concurrently run
in multi-threads on a single executor.

MR executor short lived and runs one large task

Spark executor is long live and runs many small
tasks

Process creation Vs Thread creation cost.
© 2015 IBM Corporation
Problems in Spark

Applications cannot share data(mostly RDDs in
Spark Context) without writing to external
Storage.

Resource allocation inefficiency
[spark.dynamicAllocation.enabled].

Not Exactly designed for interactive applications.
© 2015 IBM Corporation
Spark Internals -RDD
• RDD (Resilient Distributed Dataset)
• Lazy & Immutable
• Iterative operations before RDD
• Fault Tolerant
• Traditional way for achieving Fault Tolerance
• How does RDD achieve Fault Tolerance
• Partition
© 2015 IBM Corporation
Apache Hadoop Day 2015
Spark Internals – RDDs
sc.textFile(“hdfs://<input>”)
.filter(_.startsWith(“ERROR”))
.map(_.split(“ “)(1))
.saveAsTextFile(“hdfs://<output>”)
Stage-1
HDFS HDFSHadoopRDD FilteredRDD MappedRDD
© 2015 IBM Corporation
Apache Hadoop Day 2015
Spark Internals – RDDS
Narrow Vs Wide
Dependency
•Narrow dependency –
Each partition of parent is
Used by at max one
partition of child
•Wide dependency –
multiple child partition may
depend on one parent.
© 2015 IBM Corporation
Apache Hadoop Day 2015
Narrow/Shuffle Dependency – class diagram
© 2015 IBM Corporation
Apache Hadoop Day 2015
Task
Scheduler
Task Thread
Block
Manager
Spark Internal – Job Scheduling
RDD Object DAG Scheduler Task
Scheduler
Executor
Split DAG into Stages
and Tasks
Submit each Stage as
ready
Launches
individual
tasks
Execute tasks
Stores & serves
blocks
Rdd1.join(rdd2)
.groupBy(…)
.filter(…)
© 2015 IBM Corporation
Apache Hadoop Day 2015
Resource Allocation
• Dynamic Resource Allocation.
• Resource Allocation Policy.
 Request Policy
 Remove Policy
© 2015 IBM Corporation
Apache Hadoop Day 2015
Request/Remove Policy
Request
• Pending tasks to be scheduled.
• Spark request executors in rounds.
• spark.dynamicAllocation.schedulerBacklogTim
eout &
spark.dynamicAllocation.sustainedSchedulerB
acklogTimeout
Remove
• Removes when its idle for more than
spark.dynamicAllocation.executorIdleTimeo
ut seconds
© 2015 IBM Corporation
Apache Hadoop Day 2015
Graceful Decommission of Executors
• State before Dynamic Allocation
• With Dynamic Allocation
• Complexity increases with Shuffle
• External Shuffle Service
• State of Cached data either in disk or memory
© 2015 IBM Corporation
Apache Hadoop Day 2015
Fair Scheduler
• What is Fair Scheduling?
• How to enable Fair Scheduler
val conf = new
SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
val sc = new SparkContext(conf)
• Fair Scheduler Pools
© 2015 IBM Corporation
RDD Deep Dive
• RDD Basics
• How to create
• RDD Operations
• Lineage
• Partitions
• Shuffle
• Type of RDDs
• Extending RDD
• Caching in RDD
© 2015 IBM Corporation
RDD Basics
• RDD (Resilient Distributed Dataset)
• Distributed collection of Object
• Resilient - Ability to re-compute missing partitions
(node failure)
• Distributed – Split across multiple partitions
• Dataset - Can contain any type, Python/Java/Scala
Object or User defined Object
• Fundamental unit of data in spark
© 2015 IBM Corporation
RDD Basics – How to create
Two ways

Loading external datasets
− Spark supports wide range of sources
− Access HDFS data through InputFormat & OutputFormat
of Hadoop.
− Supports custom Input/Output format

Parallelizing collection in driver program
val lineRDD = sc.textFile(“hdfs:///path/to/Readme.md”)
textFile(“/my/directory/*”) or textFile(“/my/directory/*.gz”)
SparkContext.wholeTextFiles returns (filename,content) pair
val listRDD = sc.parallelize(List(“spark”,”meetup”,”deepdive”))
© 2015 IBM Corporation
RDD Operations

Two type of Operations

Transformation

Action

Transformations are lazy, nothing actually happens until an action is
called.

Action triggers the computation

Action returns values to driver or writes data to external storage.
© 2015 IBM Corporation
Lazy Evaluation
−
Transformation on RDD, don’t get performed immediately
−
Spark Internally records metadata to track the operation
−
Loading data into RDD also gets lazy evaluated
−
Lazy evaluation reduce number of passes on the data by
grouping operations
−
MapReduce – Burden on developer to merge the operation,
complex map.
−
Failure in Persisting the RDD will re-compute complete lineage
every time.
© 2015 IBM Corporation
RDD In Action
sc.textFile(“hdfs://file.txt")
.flatMap(line=>line.split(" "))
.map(word => (word,1))
.reduceByKey(_+_)
.collect()
I scream you
scream lets all
scream for
icecream!
I wish I were
what I was when
I wished I were
what I am.
I
scream
you
scream
lets
all
scream
for
icecream
(I,1)
(scream,1)
(you,1)
(scream,1)
(lets,1)
(all,1)
(scream,1)
(icecream,1)
(icecream,1)
(scream,3)
(you,1)
(lets,1)
(I,1)
(all,1)
© 2015 IBM Corporation
Lineage Demo
© 2015 IBM Corporation
RDD Partition

Partition Definition

Fragments of RDD

Fragmentation allows Spark to execute in Parallel.

Partitions are distributed across cluster(Spark worker)

Partitioning

Impacts parallelism

Impacts performance
© 2015 IBM Corporation
Importance of partition Tuning

Too few partitions

Less concurrency, unused cores.

More susceptible to data skew

Increased memory pressure for groupBy, reduceByKey,
sortByKey, etc.

Too many partitions

Framework overhead (more scheduling latency than the time
needed for actual task.)

Many CPU context-switching

Need “reasonable number” of partitions

Commonly between 100 and 10,000 partitions

Lower bound: At least ~2x number of cores in cluster

Upper bound: Ensure tasks take at least 100ms
© 2015 IBM Corporation
How Spark Partitions data

Input data partition

Shuffle transformations

Custom Partitioner
© 2015 IBM Corporation
Partition - Input Data

Spark uses same class as Hadoop to perform Input/Output

sc.textFile(“hdfs://…”) invokes Hadoop TextInputFormat

Below are Knobs which defines #Partitions

dfs.block.size – default 128MB(Hadoop 2.0)

numPartition – can be used to increase number of partition
default is 0 which means 1 partition

mapreduce.input.fileinputformat.split.minsize – default 1kb

Partition Size = Max(minsize,Min(goalSize,blockSize)

goalSize = totalInputSize/numPartitions

32MB, 0, 1KB, 640MB total size - Defaults
−Max(1kb,Min(640MB,32MB) ) = 20 partitions

32MB, 30, 1KB , 640MB total size - Want more partition
−Max(1kb,Min(32MB,32MB)) = 32 partition

32MB, 5, 1KB = Max(1kb,Min(120MB,32MB)) = 20 – Bigger size
partition

32MB,0, 64MB = Max(64MB,Min(640MB,32MB)) = 10 Bigger size
partition
© 2015 IBM Corporation
Partition - Shuffle transformations

All shuffle transformation provides parameter
for desire number of partition

Default Behavior - Spark Uses HashPartitioner.
− If spark.default.parallelism is set , takes that as # of
partitions
− If spark.default.parallelism is not set
largest upstream RDD ‘s number of partition
− Reduces chances of out of memory
1. groupByKey
2. reduceByKey
3. aggregateByKey
4. sortByKey
5. join
6. cogroup
7. cartesian
8. coalesce
9. repartition
10.repartitionAndSort
WithinPartitions
Shuffle Transformation
© 2015 IBM Corporation
Partition - Repartitioning

RDD provides two operators

repartition(numPartitions)
− Can Increase/decrease number of partitions
− Internally does shuffle
− expensive due to shuffle
− For decreasing partition use coalesce

Coalesce(numPartition,Shuffle:[true/false])
− Decreases partitions
− Goes for narrow dependencies
− Avoids shuffle
− In case of drastic reduction may trigger shuffle
© 2015 IBM Corporation
Custom Partitioner

Partition the data according to use case & data structure

Provides control over no of partitions, distribution of data

Extends Partitioner class, need to implement getPartitions &
numPartitons
© 2015 IBM Corporation
Partitioning Demo
© 2015 IBM Corporation
Shuffle - GroupByKey Vs ReduceByKey
val wordCountsWithGroup = rdd
.groupByKey()
.map(t => (t._1, t._2.sum)) .collect()
© 2015 IBM Corporation
Shuffle - GroupByKey Vs ReduceByKey
val wordPairsRDD = rdd.map(word => (word, 1))
val wordCountsWithReduce = wordPairsRDD
.reduceByKey(_ + _)
.collect()
© 2015 IBM Corporation
The Shuffle

Redistribution of data among partition between stages.

Most of the Performance, Reliability Scalability Issues in Spark occurs
within Shuffle.

Like MapReduce Spark shuffle uses Pull model.

Consistently evolved and still an area of research in Spark
© 2015 IBM Corporation
Shuffle Overview
• Spark run job stage by stage.
• Stages are build up by DAGScheduler according to RDD’s
ShuffleDependency
• e.g. ShuffleRDD / CoGroupedRDD will have a
ShuffleDependency
• Many operator will create ShuffleRDD / CoGroupedRDD under
the hood.
• Repartition/CombineByKey/GroupBy/ReduceByKey/cogrou
p
• Many other operator will further call into the above
operators
•
e.g. various join operator will call CoGroup.
•
© 2015 IBM Corporation
You have seen this
join
union
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D:
map
E:
F:
G:
© 2015 IBM Corporation
Shuffle is Expensive
• When doing shuffle, data no longer stay in memory only, gets
written to disk.
• For spark, shuffle process might involve
• Data partition: which might involve very expensive data
sorting works etc.
• Data ser/deser: to enable data been transfer through
network or across processes.
• Data compression: to reduce IO bandwidth etc.
• Disk IO: probably multiple times on one single data block
• E.g. Shuffle Spill, Merge combine
© 2015 IBM Corporation
Shuffle History

Shuffle module in Spark has evolved over time.

Spark(0.6-0.7) – Same code path as RDD’s persist method.
MEMORY_ONLY , DISK_ONLY options available.

Spark (0.8-0.9)
-
Separate code for shuffle, ShuffleBlockManager &
BlockObjectWriter for shuffle only.
-
Shuffle optimization - Consolidate Shuffle Write.

Spark 1.0 – Introduced pluggable shuffle framework

Spark 1.1 – Sort based Shuffle Implementation

Spark 1.2 - Netty transfer Implementation. Sort based shuffle is
default now.

Spark 1.2+ - External shuffle service etc.
© 2015 IBM Corporation
Understanding Shuffle

Input Aggregation

Types of Shuffle

Hash based
− Basic Hash Shuffle
− Consolidate Hash Shuffle

Sort Based Shuffle
© 2015 IBM Corporation
Input Aggregation

Like MapReduce, Spark involves aggregate(Combiner) on map side.

Aggregation is done in ShuffleMapTask using

AppendOnlyMap (In Memory Hash Table combiner)
− Key’s are never removed , values gets updated

ExternalAppendOnlyMap (In Memory and disk Hash Table combiner)
− A Hash Map which can spill to disk
− Append Only Map that spill data to disk if insufficient memory

Shuffle file In-Memory Buffer – Shuffle writes to In-memory buffer before
writing to a shuffle file.
© 2015 IBM Corporation
Shuffle Types – Basic Hash Shuffle

Hash Based shuffle (spark.shuffle.manager). Hash Partitions the data
for reducers

Each map task writes each bucket to a file.

#Map Tasks = M

#Reduce Tasks = R

#Shuffle File = M*R , #In-Memory Buffer = M*R
© 2015 IBM Corporation
Shuffle Types – Basic Hash Shuffle

Problem

Lets use 100KB as buffer size

We have 10000 reducers

10 Mapper tasks Per Executor

In-Memory Buffer size will = 100KB*10000*10

Buffer need will be 10GB/Executor

This huge amount of Buffer is not acceptable and this
Implementation cant support 10000 reducer.
© 2015 IBM Corporation
Shuffle Types – Consolidate Hash Shuffle

Solution to decrease the IN-Memory Buffer size , No of File.

Within Executor, Map Tasks writes each Bucket to a Segment of the file.

#Shuffle file/Executor = #Reducers,

# In-Memory Buffer/ Executor=#R( Reducers)
© 2015 IBM Corporation
Shuffle Types – Sort Based Shuffle

Consolidate Hash Shuffle needs one file for each reducer.
- Total C*R intermediate file , C = # of executor running map
tasks

Still too many files(e.g ~10k reducers),

Need significant memory for compression & serialization
buffer.

Too many open files issue.

Sort Based Shuflle is similar to map-side shuffle from
MapReduce

Introduced in Spark 1.1 , now its default shuffle
© 2015 IBM Corporation
Shuffle Types – Sort Based Shuffle

Map output records from each task are kept in memory till they can fit.

Once full , data gets sorted by partition and spilled to single file.

Each Map task generate 1 data file and one index file

Utilize external sorter to do the sort work

If map side combiner is required data will be sorted by key and partition
otherwise only by partition

#reducer <=200, no sorting uses hash approach, generate file per reducer
and merge them into a single file
© 2015 IBM Corporation
Shuffle Reader

On Reader side both Sort & Hash Shuffle uses Hash Shuffle Reader

On reducer side a set of thread fetch remote output map blocks

Once block comes its records are de-serialized and passed into a
result queue.

Records are passed to ExternalAppendOnlyMap , for ordering
operation like sortByKey records are passed to externalSorter.
20
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Reduce Task
Aggregator Aggregator Aggregator Aggregator
Reduce Task Reduce Task Reduce Task
© 2015 IBM Corporation
Type of RDDS - RDD Interface
Base for all RDDs (RDD.scala), consists of

A Set of partitions (“splits” in Hadoop)

A List of dependencies on parent RDDs

A Function to compute the partition from its
parents

Optional preferred locations for each partition

A Partitioner defines strategy for partitionig
hash/range

Basic operations like map, filter, persist etc
Partitions
Dependencies
Compute
PreferredLocations
Partitioner
map,filter,persist
s
Lineage
Optimized execution
Operations
© 2015 IBM Corporation
Example: HadoopRDD

partitions = one per HDFS block

dependencies = none

compute(partition) = read corresponding block

preferredLocations(part) = HDFS block location

partitioner = none
© 2015 IBM Corporation
Example: MapPartitionRDD

partitions = Parent Partition

dependencies = “one-to-one “parent RDD

compute(partition) = apply map on parent

preferredLocations(part) = none (ask parent)

partitioner = none
© 2015 IBM Corporation
Example: CoGroupRDD

partitions = one per reduce task

dependencies = could be narrow or wide dependency

compute(partition) = read and join shuffled data

preferredLocations(part) = none

partitioner = HashPartitioner(numTasks)
© 2015 IBM Corporation
Extending RDDs
Extend RDDs to

To add transformation/actions

Allow developer to express domain specific calculation in
cleaner way

Improves code readability

Easy to maintain

Custom RDD for Input Source, Domain

Way to add new Input data source

Better way to express domain specific data

Better control on partitioning and distribution
© 2015 IBM Corporation
How to Extend

Add custom operators to RDD

Use scala Impilicits

Feels and works like built in operator

You can add operator to Specific RDD or to all

Custom RDD

Extend RDD API to create our own RDD

Implement compute & getPartitions abstract method
© 2015 IBM Corporation
Implicit Class

Creates an extension method to existing type

Introduced in Scala 2.10

Implicits are compile time checked. Implicit class gets resolved
into a class definition with implict conversion

We will use Implicit to add new method in RDD
© 2015 IBM Corporation
Adding new Operator to RDD

We will use Scala Implicit feature to add a new operator to an
existingRDD

This operator will show up only in our RDD

Implicit conversions are handled by Scala
© 2015 IBM Corporation
Custom RDD Implementation

Extending RDD allow you to create your own custom RDD
structure

Custom RDD allow control on computation, change partition &
locality information
© 2015 IBM Corporation
Caching in RDD

Spark allows caching/Persisting entire dataset in memory

Persisting RDD in cache

First time when it is computed it will be kept in memory

Reuse the the cache partition in next set of operation

Fault-tolerant, recomputed in case of failure

Caching is key tool for interactive and iterative algorithm

Persist support different storage level

Storage level - In memory , Disk or both , Techyon

Serialized Vs Deserialized
© 2015 IBM Corporation
Caching In RDD

Spark Context tracks persistent RDDs

Block Manager puts partition in memory when first evaluated

Cache is lazy evaluation , no caching without an action.

Shuffle also keeps its data in Cache after shuffle operations.

We still need to cache shuffle RDDs
© 2015 IBM Corporation
Caching Demo

More Related Content

What's hot

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 

What's hot (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Spark
SparkSpark
Spark
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 

Viewers also liked

Viewers also liked (20)

Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Spark
SparkSpark
Spark
 
Topfoison product catalog
Topfoison product catalogTopfoison product catalog
Topfoison product catalog
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos
 
Remote temperature monitor (DHT11)
Remote temperature monitor (DHT11)Remote temperature monitor (DHT11)
Remote temperature monitor (DHT11)
 
Data Science with Spark by Saeed Aghabozorgi
Data Science with Spark by Saeed Aghabozorgi Data Science with Spark by Saeed Aghabozorgi
Data Science with Spark by Saeed Aghabozorgi
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015
 

Similar to Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive

Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 

Similar to Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive (20)

IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Spark 101
Spark 101Spark 101
Spark 101
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 

Recently uploaded

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Recently uploaded (20)

Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive

  • 1. © 2015 IBM Corporation Apache Hadoop Day 2015 Intro to Apache Spark LIGHTENING FAST CLUSTER COMPUTING
  • 2. © 2015 IBM Corporation
  • 3. © 2015 IBM Corporation Apache Hadoop Day 2015 Mapreduce Limitations • Lots of boilerplate , makes it complex to program in MR. • Disk based approach not good for iterative usecases. • Batch processing not fit for real time. In short no single solution, people build specialized systems as workarounds.
  • 4. © 2015 IBM Corporation Spark Goal Batch Interactiv e Streamin g Single Framework! Support batch, streaming, and interactive computations… … in a unified framework Easy to develop sophisticated algorithms (e.g., graph, ML algos)
  • 5. © 2015 IBM Corporation Spark Core Supports Scala,Java,Python,R Spark Core Supports Scala,Java,Python,R Spark SQL Interactive Spark SQL Interactive Spark Streaming realtime Spark Streaming realtime Mlib/Spark.ml Machine learning Mlib/Spark.ml Machine learning GraphX Graph processing GraphX Graph processing Spark Stack Unified engine across diverse workload and environments
  • 6. © 2015 IBM Corporation Data processing landscape GraphLa b Girap h … Graph Grap h Graph Pre gel Googl e Apache Dato
  • 7. © 2015 IBM Corporation Data processing landscape Dreme l GraphLa b Girap h Drill Impala … SQ L Graph Grap h SQ L Graph Pre gel Googl e Google SQL Apach e Apache Cloudera Dato
  • 8. © 2015 IBM Corporation Data processing landscape Dreme l GraphLa b Girap h Drill Impala … SQ L Graph Grap h SQ L Graph Pre gel Googl e Google SQL Apach e Apache DAG Tez Apache Cloudera Stream Stor m Apache Dato
  • 9. © 2015 IBM Corporation Data processing landscape Dr e aphLa b Girap h Graph Grap h G Pregel Go Apache DAG ez Apache raph ogle SQL Drill Apache SQL mel T Google SQL Impala Cloudera Storm Stream … Gr Apache Dato Stop surarmement now !
  • 10. © 2015 IBM Corporation Spark  Unifies batch, streaming, interactive comp.  Easy to build sophisticated applications  Support iterative, graph-parallel algorithms  Powerful APIs in Scala, Python, Java Spark Spark Streaming Shark SQL BlinkDB GraphX MLlib Streami ng Batch, Interactiv e Batch, Interactive Interacti ve Data-parallel, Iterative Sophisticated algos.
  • 11. © 2015 IBM Corporation MapReduce Vs Spark  Mapreduce run each task in its own process, when tasks completed the process dies (MultithreadedMapper)  In Spark by default many tasks are concurrently run in multi-threads on a single executor.  MR executor short lived and runs one large task  Spark executor is long live and runs many small tasks  Process creation Vs Thread creation cost.
  • 12. © 2015 IBM Corporation Problems in Spark  Applications cannot share data(mostly RDDs in Spark Context) without writing to external Storage.  Resource allocation inefficiency [spark.dynamicAllocation.enabled].  Not Exactly designed for interactive applications.
  • 13. © 2015 IBM Corporation Spark Internals -RDD • RDD (Resilient Distributed Dataset) • Lazy & Immutable • Iterative operations before RDD • Fault Tolerant • Traditional way for achieving Fault Tolerance • How does RDD achieve Fault Tolerance • Partition
  • 14. © 2015 IBM Corporation Apache Hadoop Day 2015 Spark Internals – RDDs sc.textFile(“hdfs://<input>”) .filter(_.startsWith(“ERROR”)) .map(_.split(“ “)(1)) .saveAsTextFile(“hdfs://<output>”) Stage-1 HDFS HDFSHadoopRDD FilteredRDD MappedRDD
  • 15. © 2015 IBM Corporation Apache Hadoop Day 2015 Spark Internals – RDDS Narrow Vs Wide Dependency •Narrow dependency – Each partition of parent is Used by at max one partition of child •Wide dependency – multiple child partition may depend on one parent.
  • 16. © 2015 IBM Corporation Apache Hadoop Day 2015 Narrow/Shuffle Dependency – class diagram
  • 17. © 2015 IBM Corporation Apache Hadoop Day 2015 Task Scheduler Task Thread Block Manager Spark Internal – Job Scheduling RDD Object DAG Scheduler Task Scheduler Executor Split DAG into Stages and Tasks Submit each Stage as ready Launches individual tasks Execute tasks Stores & serves blocks Rdd1.join(rdd2) .groupBy(…) .filter(…)
  • 18. © 2015 IBM Corporation Apache Hadoop Day 2015 Resource Allocation • Dynamic Resource Allocation. • Resource Allocation Policy.  Request Policy  Remove Policy
  • 19. © 2015 IBM Corporation Apache Hadoop Day 2015 Request/Remove Policy Request • Pending tasks to be scheduled. • Spark request executors in rounds. • spark.dynamicAllocation.schedulerBacklogTim eout & spark.dynamicAllocation.sustainedSchedulerB acklogTimeout Remove • Removes when its idle for more than spark.dynamicAllocation.executorIdleTimeo ut seconds
  • 20. © 2015 IBM Corporation Apache Hadoop Day 2015 Graceful Decommission of Executors • State before Dynamic Allocation • With Dynamic Allocation • Complexity increases with Shuffle • External Shuffle Service • State of Cached data either in disk or memory
  • 21. © 2015 IBM Corporation Apache Hadoop Day 2015 Fair Scheduler • What is Fair Scheduling? • How to enable Fair Scheduler val conf = new SparkConf().setMaster(...).setAppName(...) conf.set("spark.scheduler.mode", "FAIR") val sc = new SparkContext(conf) • Fair Scheduler Pools
  • 22. © 2015 IBM Corporation RDD Deep Dive • RDD Basics • How to create • RDD Operations • Lineage • Partitions • Shuffle • Type of RDDs • Extending RDD • Caching in RDD
  • 23. © 2015 IBM Corporation RDD Basics • RDD (Resilient Distributed Dataset) • Distributed collection of Object • Resilient - Ability to re-compute missing partitions (node failure) • Distributed – Split across multiple partitions • Dataset - Can contain any type, Python/Java/Scala Object or User defined Object • Fundamental unit of data in spark
  • 24. © 2015 IBM Corporation RDD Basics – How to create Two ways  Loading external datasets − Spark supports wide range of sources − Access HDFS data through InputFormat & OutputFormat of Hadoop. − Supports custom Input/Output format  Parallelizing collection in driver program val lineRDD = sc.textFile(“hdfs:///path/to/Readme.md”) textFile(“/my/directory/*”) or textFile(“/my/directory/*.gz”) SparkContext.wholeTextFiles returns (filename,content) pair val listRDD = sc.parallelize(List(“spark”,”meetup”,”deepdive”))
  • 25. © 2015 IBM Corporation RDD Operations  Two type of Operations  Transformation  Action  Transformations are lazy, nothing actually happens until an action is called.  Action triggers the computation  Action returns values to driver or writes data to external storage.
  • 26. © 2015 IBM Corporation Lazy Evaluation − Transformation on RDD, don’t get performed immediately − Spark Internally records metadata to track the operation − Loading data into RDD also gets lazy evaluated − Lazy evaluation reduce number of passes on the data by grouping operations − MapReduce – Burden on developer to merge the operation, complex map. − Failure in Persisting the RDD will re-compute complete lineage every time.
  • 27. © 2015 IBM Corporation RDD In Action sc.textFile(“hdfs://file.txt") .flatMap(line=>line.split(" ")) .map(word => (word,1)) .reduceByKey(_+_) .collect() I scream you scream lets all scream for icecream! I wish I were what I was when I wished I were what I am. I scream you scream lets all scream for icecream (I,1) (scream,1) (you,1) (scream,1) (lets,1) (all,1) (scream,1) (icecream,1) (icecream,1) (scream,3) (you,1) (lets,1) (I,1) (all,1)
  • 28. © 2015 IBM Corporation Lineage Demo
  • 29. © 2015 IBM Corporation RDD Partition  Partition Definition  Fragments of RDD  Fragmentation allows Spark to execute in Parallel.  Partitions are distributed across cluster(Spark worker)  Partitioning  Impacts parallelism  Impacts performance
  • 30. © 2015 IBM Corporation Importance of partition Tuning  Too few partitions  Less concurrency, unused cores.  More susceptible to data skew  Increased memory pressure for groupBy, reduceByKey, sortByKey, etc.  Too many partitions  Framework overhead (more scheduling latency than the time needed for actual task.)  Many CPU context-switching  Need “reasonable number” of partitions  Commonly between 100 and 10,000 partitions  Lower bound: At least ~2x number of cores in cluster  Upper bound: Ensure tasks take at least 100ms
  • 31. © 2015 IBM Corporation How Spark Partitions data  Input data partition  Shuffle transformations  Custom Partitioner
  • 32. © 2015 IBM Corporation Partition - Input Data  Spark uses same class as Hadoop to perform Input/Output  sc.textFile(“hdfs://…”) invokes Hadoop TextInputFormat  Below are Knobs which defines #Partitions  dfs.block.size – default 128MB(Hadoop 2.0)  numPartition – can be used to increase number of partition default is 0 which means 1 partition  mapreduce.input.fileinputformat.split.minsize – default 1kb  Partition Size = Max(minsize,Min(goalSize,blockSize)  goalSize = totalInputSize/numPartitions  32MB, 0, 1KB, 640MB total size - Defaults −Max(1kb,Min(640MB,32MB) ) = 20 partitions  32MB, 30, 1KB , 640MB total size - Want more partition −Max(1kb,Min(32MB,32MB)) = 32 partition  32MB, 5, 1KB = Max(1kb,Min(120MB,32MB)) = 20 – Bigger size partition  32MB,0, 64MB = Max(64MB,Min(640MB,32MB)) = 10 Bigger size partition
  • 33. © 2015 IBM Corporation Partition - Shuffle transformations  All shuffle transformation provides parameter for desire number of partition  Default Behavior - Spark Uses HashPartitioner. − If spark.default.parallelism is set , takes that as # of partitions − If spark.default.parallelism is not set largest upstream RDD ‘s number of partition − Reduces chances of out of memory 1. groupByKey 2. reduceByKey 3. aggregateByKey 4. sortByKey 5. join 6. cogroup 7. cartesian 8. coalesce 9. repartition 10.repartitionAndSort WithinPartitions Shuffle Transformation
  • 34. © 2015 IBM Corporation Partition - Repartitioning  RDD provides two operators  repartition(numPartitions) − Can Increase/decrease number of partitions − Internally does shuffle − expensive due to shuffle − For decreasing partition use coalesce  Coalesce(numPartition,Shuffle:[true/false]) − Decreases partitions − Goes for narrow dependencies − Avoids shuffle − In case of drastic reduction may trigger shuffle
  • 35. © 2015 IBM Corporation Custom Partitioner  Partition the data according to use case & data structure  Provides control over no of partitions, distribution of data  Extends Partitioner class, need to implement getPartitions & numPartitons
  • 36. © 2015 IBM Corporation Partitioning Demo
  • 37. © 2015 IBM Corporation Shuffle - GroupByKey Vs ReduceByKey val wordCountsWithGroup = rdd .groupByKey() .map(t => (t._1, t._2.sum)) .collect()
  • 38. © 2015 IBM Corporation Shuffle - GroupByKey Vs ReduceByKey val wordPairsRDD = rdd.map(word => (word, 1)) val wordCountsWithReduce = wordPairsRDD .reduceByKey(_ + _) .collect()
  • 39. © 2015 IBM Corporation The Shuffle  Redistribution of data among partition between stages.  Most of the Performance, Reliability Scalability Issues in Spark occurs within Shuffle.  Like MapReduce Spark shuffle uses Pull model.  Consistently evolved and still an area of research in Spark
  • 40. © 2015 IBM Corporation Shuffle Overview • Spark run job stage by stage. • Stages are build up by DAGScheduler according to RDD’s ShuffleDependency • e.g. ShuffleRDD / CoGroupedRDD will have a ShuffleDependency • Many operator will create ShuffleRDD / CoGroupedRDD under the hood. • Repartition/CombineByKey/GroupBy/ReduceByKey/cogrou p • Many other operator will further call into the above operators • e.g. various join operator will call CoGroup. •
  • 41. © 2015 IBM Corporation You have seen this join union groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: map E: F: G:
  • 42. © 2015 IBM Corporation Shuffle is Expensive • When doing shuffle, data no longer stay in memory only, gets written to disk. • For spark, shuffle process might involve • Data partition: which might involve very expensive data sorting works etc. • Data ser/deser: to enable data been transfer through network or across processes. • Data compression: to reduce IO bandwidth etc. • Disk IO: probably multiple times on one single data block • E.g. Shuffle Spill, Merge combine
  • 43. © 2015 IBM Corporation Shuffle History  Shuffle module in Spark has evolved over time.  Spark(0.6-0.7) – Same code path as RDD’s persist method. MEMORY_ONLY , DISK_ONLY options available.  Spark (0.8-0.9) - Separate code for shuffle, ShuffleBlockManager & BlockObjectWriter for shuffle only. - Shuffle optimization - Consolidate Shuffle Write.  Spark 1.0 – Introduced pluggable shuffle framework  Spark 1.1 – Sort based Shuffle Implementation  Spark 1.2 - Netty transfer Implementation. Sort based shuffle is default now.  Spark 1.2+ - External shuffle service etc.
  • 44. © 2015 IBM Corporation Understanding Shuffle  Input Aggregation  Types of Shuffle  Hash based − Basic Hash Shuffle − Consolidate Hash Shuffle  Sort Based Shuffle
  • 45. © 2015 IBM Corporation Input Aggregation  Like MapReduce, Spark involves aggregate(Combiner) on map side.  Aggregation is done in ShuffleMapTask using  AppendOnlyMap (In Memory Hash Table combiner) − Key’s are never removed , values gets updated  ExternalAppendOnlyMap (In Memory and disk Hash Table combiner) − A Hash Map which can spill to disk − Append Only Map that spill data to disk if insufficient memory  Shuffle file In-Memory Buffer – Shuffle writes to In-memory buffer before writing to a shuffle file.
  • 46. © 2015 IBM Corporation Shuffle Types – Basic Hash Shuffle  Hash Based shuffle (spark.shuffle.manager). Hash Partitions the data for reducers  Each map task writes each bucket to a file.  #Map Tasks = M  #Reduce Tasks = R  #Shuffle File = M*R , #In-Memory Buffer = M*R
  • 47. © 2015 IBM Corporation Shuffle Types – Basic Hash Shuffle  Problem  Lets use 100KB as buffer size  We have 10000 reducers  10 Mapper tasks Per Executor  In-Memory Buffer size will = 100KB*10000*10  Buffer need will be 10GB/Executor  This huge amount of Buffer is not acceptable and this Implementation cant support 10000 reducer.
  • 48. © 2015 IBM Corporation Shuffle Types – Consolidate Hash Shuffle  Solution to decrease the IN-Memory Buffer size , No of File.  Within Executor, Map Tasks writes each Bucket to a Segment of the file.  #Shuffle file/Executor = #Reducers,  # In-Memory Buffer/ Executor=#R( Reducers)
  • 49. © 2015 IBM Corporation Shuffle Types – Sort Based Shuffle  Consolidate Hash Shuffle needs one file for each reducer. - Total C*R intermediate file , C = # of executor running map tasks  Still too many files(e.g ~10k reducers),  Need significant memory for compression & serialization buffer.  Too many open files issue.  Sort Based Shuflle is similar to map-side shuffle from MapReduce  Introduced in Spark 1.1 , now its default shuffle
  • 50. © 2015 IBM Corporation Shuffle Types – Sort Based Shuffle  Map output records from each task are kept in memory till they can fit.  Once full , data gets sorted by partition and spilled to single file.  Each Map task generate 1 data file and one index file  Utilize external sorter to do the sort work  If map side combiner is required data will be sorted by key and partition otherwise only by partition  #reducer <=200, no sorting uses hash approach, generate file per reducer and merge them into a single file
  • 51. © 2015 IBM Corporation Shuffle Reader  On Reader side both Sort & Hash Shuffle uses Hash Shuffle Reader  On reducer side a set of thread fetch remote output map blocks  Once block comes its records are de-serialized and passed into a result queue.  Records are passed to ExternalAppendOnlyMap , for ordering operation like sortByKey records are passed to externalSorter. 20 Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Reduce Task Aggregator Aggregator Aggregator Aggregator Reduce Task Reduce Task Reduce Task
  • 52. © 2015 IBM Corporation Type of RDDS - RDD Interface Base for all RDDs (RDD.scala), consists of  A Set of partitions (“splits” in Hadoop)  A List of dependencies on parent RDDs  A Function to compute the partition from its parents  Optional preferred locations for each partition  A Partitioner defines strategy for partitionig hash/range  Basic operations like map, filter, persist etc Partitions Dependencies Compute PreferredLocations Partitioner map,filter,persist s Lineage Optimized execution Operations
  • 53. © 2015 IBM Corporation Example: HadoopRDD  partitions = one per HDFS block  dependencies = none  compute(partition) = read corresponding block  preferredLocations(part) = HDFS block location  partitioner = none
  • 54. © 2015 IBM Corporation Example: MapPartitionRDD  partitions = Parent Partition  dependencies = “one-to-one “parent RDD  compute(partition) = apply map on parent  preferredLocations(part) = none (ask parent)  partitioner = none
  • 55. © 2015 IBM Corporation Example: CoGroupRDD  partitions = one per reduce task  dependencies = could be narrow or wide dependency  compute(partition) = read and join shuffled data  preferredLocations(part) = none  partitioner = HashPartitioner(numTasks)
  • 56. © 2015 IBM Corporation Extending RDDs Extend RDDs to  To add transformation/actions  Allow developer to express domain specific calculation in cleaner way  Improves code readability  Easy to maintain  Custom RDD for Input Source, Domain  Way to add new Input data source  Better way to express domain specific data  Better control on partitioning and distribution
  • 57. © 2015 IBM Corporation How to Extend  Add custom operators to RDD  Use scala Impilicits  Feels and works like built in operator  You can add operator to Specific RDD or to all  Custom RDD  Extend RDD API to create our own RDD  Implement compute & getPartitions abstract method
  • 58. © 2015 IBM Corporation Implicit Class  Creates an extension method to existing type  Introduced in Scala 2.10  Implicits are compile time checked. Implicit class gets resolved into a class definition with implict conversion  We will use Implicit to add new method in RDD
  • 59. © 2015 IBM Corporation Adding new Operator to RDD  We will use Scala Implicit feature to add a new operator to an existingRDD  This operator will show up only in our RDD  Implicit conversions are handled by Scala
  • 60. © 2015 IBM Corporation Custom RDD Implementation  Extending RDD allow you to create your own custom RDD structure  Custom RDD allow control on computation, change partition & locality information
  • 61. © 2015 IBM Corporation Caching in RDD  Spark allows caching/Persisting entire dataset in memory  Persisting RDD in cache  First time when it is computed it will be kept in memory  Reuse the the cache partition in next set of operation  Fault-tolerant, recomputed in case of failure  Caching is key tool for interactive and iterative algorithm  Persist support different storage level  Storage level - In memory , Disk or both , Techyon  Serialized Vs Deserialized
  • 62. © 2015 IBM Corporation Caching In RDD  Spark Context tracks persistent RDDs  Block Manager puts partition in memory when first evaluated  Cache is lazy evaluation , no caching without an action.  Shuffle also keeps its data in Cache after shuffle operations.  We still need to cache shuffle RDDs
  • 63. © 2015 IBM Corporation Caching Demo

Editor's Notes

  1. MapReduce has MultithreadedMapper
  2. MapReduce has MultithreadedMapper
  3. Write coarse-grained and not fine grained. Intermediate results written to memory whereas between 2 mapreduce tasks the data is written to disk only. Replicate data or log updates across the machines. RDD provides fault tolerance by logging the transformations used to build a dataset (its lineage) rather than the actual data
  4. RDDs can hold premitive , sequence , scala objects, mixed type Special RDDs are their for special purpose – Pair RDD, Double RDD, Sequence File RDD
  5. Map leads to narrow dependency, while join lead to wide dependency. Wide dependency needs shuffling , parent gets materialized.
  6. Map leads to narrow dependency, while join lead to wide dependency. Wide dependency needs shuffling , parent gets materialized.
  7. In the Driver, there is something called DAG Scheduler…looks at the DAG all it understands its wide or narrow. The DAG scheduler than submits the first stage to the Task Scheduler which is also in the driver. A stage is split into tasks. An task is data + computation. The TS determines the number of tasks needed for the stage and allocate to the executors. Execute heap gives 60% to CachedRDD, 20% to shuffle and 20% to User Program by default.
  8. (file systems &amp; file formats – NFS,HDFS,S3, CSV,JSON,Sequence,Protocol Buffer)
  9. Transformation doesn’t mutate original RDD, always returns a new RDD
  10. fragmentation is what enables Spark to execute in parallel, and the level of fragmentation is a function of the number of partitions of your RDD The number of partitions is important because a stage in Spark will operate on one partition at a time (and load the data in that partition into memory)
  11. since with fewer partitions there’s more data in each partition, you increase the memory pressure on your program. More Network and disk IO
  12. dfs.block.size - The default value in Hadoop 2.0 is 128MB. In the local mode the corresponding parameter is fs.local.block.size (Default value 32MB). It defines the default partition size.
  13. HashPartitionerextends Partitioner class
  14. A shuffle involves two sets of tasks: tasks from the stage producing the shuffle data and tasks from the stage consuming it. For historical reasons, the tasks writing out shuffle data are known as “map task” and the tasks reading the shuffle data are known as “reduce tasks Every map task writes out data to local disk, and then the reduce tasks make remote requests to fetch that data
  15. Just the same as Hadoop Map Reduce, Spark shuffle involves the aggregate step (combiner) before writing map outputs (intermediate values) to buckets. Spark also writes to a small buffers (size of buffer is configurable via spark.shuffle.file.buffer.kb) before writing to physical files to increase disk I/O speed
  16. Reduces per map shuffle file to # of Reducer