Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Introduction to Spark
1. Introduction to Spark
Sriram and Amritendu
DOS Lab, IIT Madras
“Introduction to Spark” by Sriram and Amritendu is licensed under a Creative Commons
Attribution 4.0 International License.
2. Motivation
• In Hadoop, programmer writes job using Map
Reduce abstraction
• Runtime distributes work and handles fault-tolerance
Makes analysis of large-data sets easy and reliable
Emerging Class of Applications
Machine learning
• K-means clustering
.
.
Graph Algorithms
• Page-rank
.
.
DOS Lab, IIT Madras
3. Intermediate results are reused across multiple
computations
Nature of the emerging class of applications
Iterative
Computation
DOS Lab, IIT Madras
4. Problem with Hadoop MapReduce
HDFS
R R R
Iteration 1W W W
HDFS
R R R
HDFS
W W W
Iteration 2
Results are written to HDFS
New job is launched for each
iteration
Incurs substantial storage and job launch overheads
DOS Lab, IIT Madras
5. Can we do away with these overheads?
Persist intermediate
results in memory
What if a node fails?
HDFS
L L L
Iteration 1
Memory is 10-100X faster
than disk/network Iteration 2
X
Challenge: how to handle faults efficiently?
W
R R R
W W
W W W
RR R
DOS Lab, IIT Madras
6. Approaches to handle faults
• Replication
Issues:
– Requires more storage
– More network traffic
– Log the operation
– Re-compute lost partitions
using lineage information
Master
W
M R
Replica 1
R
Replica 2
X
Can tolerate ‘r-1’
failures
• Using Lineage
D1 D2 D3
C1 C2
X
D2 D3
C2
Issues:
Recovery time can be high if re-
computation is very costly
– high iteration time
– wide dependencies
Wide
dependencies
DOS Lab, IIT Madras
7. Spark
• RDD – Resilient Distributed Datasets
– Read-only, partitioned collection of records
– Supports only coarse-grained operations
• e.g. map and group-by transformations, reduce action
– Uses lineage graph to recover from faults
D12
D11
D13
3 partitions
DOS Lab, IIT Madras
Val
8. Spark contd.
• Control placement of partitions of RDD
– can specify number of partitions
– can partition based on a key in each record
• useful in joins
• In-memory storage
– Up to 100X speedup over Hadoop for iterative
applications
• Spark can run on Hadoop YARN and read files
from HDFS
• Spark is coded using Scala
DOS Lab, IIT Madras
9. SCALA overview
• Functional programming meets object
orientation
• “No side effects” aids concurrent
programming
• Every variable is an object
• Every function is a value
DOS Lab, IIT Madras
10. Variables and Functions
var obj : java.lang.String = “Hello”
var x = new A()
def square(x: Int) : Int={
x * x
}
Return
type
DOS Lab, IIT Madras
11. Execution of a function
scala> square(2)
res0:Int = 4
scala-> square(square(6))
res1:Int = 1296
def square(x: Int) : Int={
x * x
}
DOS Lab, IIT Madras
12. Nested Functions
def factorial(i: Int): Int = {
def fact(i: Int, acc: Int): Int ={
if (i <= 1)
acc
else
fact(i - 1, i * acc)
}
fact(i, 1)
}
DOS Lab, IIT Madras
13. Nested Functions
def factorial(i: Int): Int = {
def fact(i: Int, acc: Int): Int ={
if (i <= 1)
acc
else
fact(i - 1, i * acc)
}
fact(i, 1)
}
DOS Lab, IIT Madras
14. Higher order map functions
val add = (x: Int) => x+1
val lst = list(1,2,3)
lst.map(add) : list(2,3,4)
lst.map(x => x+1) : list(2,3,4)
lst.map( _ + 1) : list(2,3,4)
DOS Lab, IIT Madras
15. Defining Objects
object Example{
def main(args: Array[String]) {
val logData = sc.textFile(logFile, 2).cache()
-------
-------
}
}
Example.main(
(“master”,”noOfMap”,”noOfReducer”) )
DOS Lab, IIT Madras
16. Spark: Filter transformation in RDD
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line =>line.contains("a"))
Here is a example of filter
Transformation, you can
notice that the filter method
will be applied on each line
and return a new RDD
test
Give me those lines which contains ‘a’
Here is a example of filter
Transformation, you can
notice that the filter method
will be applied on each line
and return a new RDD
DOS Lab, IIT Madras
17. Count
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(
line =>line.contains("a"))
numAs.count()
5
Here is a example of filter
Transformation, you can
notice that the filter method
will be applied on each line
and return a new RDD
test
DOS Lab, IIT Madras
18. Flatmap
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.flatMap(line => line.split(" "))
Take each line, split based on space and give me the array
Here is a example of filter map ( Here, is, a, example, of, filter,map )
DOS Lab, IIT Madras
19. Wordcount Example in Spark
new SparkContext(master, appName, [sparkHome],
[jars])
val file = spark.textFile("hdfs://[input_path_to_textfile]")
val counts = file.flatMap (line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://[output_path]")
DOS Lab, IIT Madras
20. Limitations
• RDDs are not suitable for applications that
require fine-grained updates
– e.g. web storage system
DOS Lab, IIT Madras
21. References
• http://www.slideshare.net/tpunder/a-brief-intro-to-scala
• Scala in depth by Joshua D. Suereth
• Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin
Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica
“Resilient distributed datasets: a fault-tolerant abstraction for in-memory
cluster computing”, In Proceedings of the 9th USENIX conference on
Networked Systems Design and Implementation (NSDI'12). USENIX
Association, Berkeley, CA, USA, 2012.
• Pictures:
– http://www.xbitlabs.com/images/news/2011-04/hard_disk_drive.jpg
– http://www.thecomputercoach.net/assets/images/256_MB_DDR_333_Cl2_5_Pc2700_R
AM_Chip_Brand_New_Chip.jpg
DOS Lab, IIT Madras