Scala and spark

Introduction to
Scala and Spark
Ciao
ciao
Vai a fare
ciao ciao
Dr. Fabio Fumarola

Contents
• Hadoop quick introduction
• An introduction to spark
• Spark – Architecture & Programming Model
2

Hadoop
• An Open-Source software for distributed storage of large
dataset on commodity hardware
• Provides a programming model/framework for processing
large dataset in parallel
3
Map
Map
Map
Reduce
Reduce
Input Output

Limitations of Map Reduce
• Slow due to replication, serialization, and disk IO
• Inefficient for:
– Iterative algorithms (Machine Learning, Graphs & Network Analysis)
– Interactive Data Mining (R, Excel, Ad hoc Reporting, Searching)
4
Input iter. 1iter. 1 iter. 2iter. 2 . . .
HDFS
read
HDFS
write
HDFS
read
HDFS
write
Map
Map
Map
Reduce
Reduce
Input Output

Solutions?
• Leverage to memory:
– load Data into Memory
– Replace disks with SSD
5

Apache Spark
• A big data analytics cluster-computing framework
written in Scala.
• Open Sourced originally in AMPLab at UC Berkley
• Provides in-memory analytics based on RDD
• Highly compatible with Hadoop Storage API
– Can run on top of an Hadoop cluster
• Developer can write programs using multiple
programming languages
6

Spark architecture
7
HDFS
Datanode Datanode Datanode....
Spark
Worker
Spark
Worker
Spark
Worker
....
CacheCache CacheCache CacheCache
Block Block Block
Cluster Manager
Spark Driver (Master)

Spark
8
iter. 1iter. 1 iter. 2iter. 2 . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write

Spark
9
iter. 1iter. 1 iter. 2iter. 2 . . .
Input
Not tied to 2 stage Map
Reduce paradigm
1. Extract a working set
2. Cache it
3. Query it repeatedly
Logistic regression in Hadoop and Spark
HDFS
read

Spark Programming Model
10
Datanode
HDFS
Datanode…
User
(Developer)
Writes
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
sc=new SparkContext
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
Driver Program
SparkContextSparkContext
Cluster
Manager
Cluster
Manager
Worker Node
ExecuterExecuter CacheCache
TaskTask TaskTask
Worker Node
ExecuterExecuter CacheCache
TaskTask TaskTask

Spark Programming Model
11
User
(Developer)
Writes
sc=new SparkContext
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
sc=new SparkContext
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
Driver Program
RDD
(Resilient
Distributed
Dataset)
RDD
(Resilient
Distributed
Dataset)
• Immutable Data structure
• In-memory (explicitly)
• Fault Tolerant
• Parallel Data Structure
• Controlled partitioning to
optimize data placement
• Can be manipulated using
rich set of operators.
• Immutable Data structure
• In-memory (explicitly)
• Fault Tolerant
• Parallel Data Structure
• Controlled partitioning to
optimize data placement
• Can be manipulated using
rich set of operators.

RDD
• Programming Interface: Programmer can perform 3
types of operations
12
Transformations
•Create a new dataset
from and existing one.
•Lazy in nature. They are
executed only when some
action is performed.
•Example :
• Map(func)
• Filter(func)
• Distinct()
Transformations
•Create a new dataset
from and existing one.
•Lazy in nature. They are
executed only when some
action is performed.
•Example :
• Map(func)
• Filter(func)
• Distinct()
Actions
•Returns to the driver
program a value or
exports data to a storage
system after performing a
computation.
•Example:
• Count()
• Reduce(funct)
• Collect
• Take()
Actions
•Returns to the driver
program a value or
exports data to a storage
system after performing a
computation.
•Example:
• Count()
• Reduce(funct)
• Collect
• Take()
Persistence
•For caching datasets in-
memory for future
operations.
•Option to store on disk or
RAM or mixed (Storage
Level).
•Example:
• Persist()
• Cache()
Persistence
•For caching datasets in-
memory for future
operations.
•Option to store on disk or
RAM or mixed (Storage
Level).
•Example:
• Persist()
• Cache()

How Spark works
• RDD: Parallel collection with partitions
• User application create RDDs, transform them, and
run actions.
• This results in a DAG (Directed Acyclic Graph) of
operators.
• DAG is compiled into stages
• Each stage is executed as a series of Task (one Task
for each Partition).
13

Example
14
sc.textFile(“/wiki/pagecounts”) RDD[String]
textFile

Example
15
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(“t”))
RDD[String]
textFile map
RDD[List[String]]

Example
16
.map(R => (R[0], int(R[1])))
RDD[String]
textFile map
RDD[List[String]]
RDD[(String, Int)]
map

Example
17
.map(R => (R[0], int(R[1])))
.reduceByKey(_+_)
RDD[String]
textFile map
RDD[List[String]]
RDD[(String, Int)]
map
RDD[(String, Int)]
reduceByKey

Example
18
.map(R => (R[0], int(R[1])))
.reduceByKey(_+_, 3)
.collect()
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
RDD[(String, Int)]
reduceByKey
Array[(String, Int)]
collect

Execution Plan
Stages are sequences of RDDs, that don’t have a Shuffle in
between
19
textFile map map
reduceByKey
collect
Stage 1 Stage 2

Execution Plan
20
textFile map map
reduceByK
ey
collect
Stage
1
Stage
2
Stage
1
Stage
2
1. Read HDFS split
2. Apply both the maps
3. Start Partial reduce
4. Write shuffle data
1. Read shuffle data
2. Final reduce
3. Send result to driver
program

Stage Execution
• Create a task for each Partition in the new RDD
• Serialize the Task
• Schedule and ship Tasks to Slaves
And all this happens internally (you need to do anything)
21
Task 1
Task 2
Task 2
Task 2

Spark Executor (Slaves)
22
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Core 1
Core 2
Core 3

Summary of Components
• Task: The fundamental unit of execution in Spark
• Stage : Set of Tasks that run parallel
• DAG: Logical Graph of RDD operations
• RDD: Parallel dataset with partitions
23

Start the docker container
From
•https://github.com/sequenceiq/docker-spark
docker run -i -t -h sandbox sequenceiq/spark:1.1.1-ubuntu
/etc/ bootstrap.sh –bash
•Run the spark shell using yarn or local
spark-shell --master yarn-client --driver-memory 1g --executor-memory
1g --executor-cores 2
24

Running the example and Shell
• To Run the examples
– $ run-example SparkPi 10
• We can start a spark shell via
– spark-shell -- master local n
• The -- master specifies the master URL for a
distributed cluster
• Example applications are also provided in Python
– spark-submit example/src/main/python/pi.py 10
25

Collections and External Datasets
• A Collection can be parallelized using the SparkContext
– val data = Array(1, 2, 3, 4, 5)
– val distData = sc.parallelize(data)
• Spark can create distributed dataset from HDFS, Cassandra,
Hbase, Amazon S3, etc.
• Spark supports text files, Sequence Files and any other
Hadoop input format
• Files can be read from an URI local or remote (hdfs://, s3n://)
– scala> val distFile = sc.textFile("data.txt")
– distFile: RDD[String] = MappedRDD@1d4cee08
– distFile.map(s => s.length).reduce((a,b) => a + b)
26

RDD operations
• Count the length of the words in the file
– val lines = sc.textFile("data.txt")
– val lineLengths = lines.map(s => s.length)
– val totalLength = lineLengths.reduce((a, b) => a + b)
• If we want to use lineLengths later we can run
– lineLengths.persist()
• This will save in the memory the value of lineLengths
before reducing
27

Passing a function to Spark
• Spark is based on Anonymous function syntax
– (x: Int) => x *x
• Which is a shorthand for
new Function1[Int,Int] {
def apply(x: Int) = x * x
}
• We can define functions with more parameters and without
– (x: Int, y: Int) => "(" + x + ", " + y + ")”
– () => { System.getProperty("user.dir") }
• The syntax is a shorthand for
– Funtion1[T,+E] … Function22[…]
28

Passing a function to Spark
object MyFunctions {
def func1(s: String): String = s + s
}
file.map(MyFunctions.func1)
class MyClass {
def func1(s: String): String = { ... }
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(func1) }
}
29

Working with Key-Value Pairs
• We can setup RDD with key-value pairs that are
caster to Tuple2 type
– val lines = sc.textFile("data.txt")
– val pairs = lines.map(s => (s, 1))
– val counts = pairs.reduceByKey((a, b) => a + b)
• We can use counts.sortByKey() to sort
• And finally counts.collect() to bring them back
• NOTE: when using custom objects as key-value we
should be sure that they have the method equals()
with hashcode()
http://docs.oracle.com/javase/7/docs/api/java/lang/Object.h
tml#hashCode() 30

Transformations
• There are several transformations supported by
Spark
– Map
– Filter
– flatMap
– mapPartitions
– ….
– http://spark.apache.org/docs/latest/programming-
guide.html
• When they are executed?
31

Actions
• The following table lists some of the common actions
supported:
– Reduce
– Collect
– Count
– First
– Take
– takeSample
32

RDD Persistence
• One of the most important capabilities in Spark is persisting
(or caching) a dataset in memory across operations
• Caching is a key tool for iterative algorithms and fast
interactive use
• You can mark an RDD to be persisted using the persist() or
cache() methods on it
• The first time it is computed in an action, it will be kept in
memory on the nodes. Spark’s cache is fault-tolerant – if any
partition of an RDD is lost, it will automatically be
recomputed using the transformations that originally created
it.
33

RDD persistence
• In addition, each persisted RDD can be stored using a
different storage level,
• for example we can persist
– the dataset on disk,
– in memory but as serialized Java objects (to save space), replicate it
across nodes,
– off-heap in Tachyon
• Note: In Python, stored objects will always be serialized with
the Pickle library, so it does not matter whether you choose a
serialized level.
• Spark also automatically persists some intermediate data in
shuffle operations (e.g. reduceByKey), even without users
calling persist
34

Which Storage Level to Choose?
• Memory only if that fit in the main memory
• If not, try using MEMORY_ONLY_SER and selecting a fast
serialization library to make the objects much more space-
efficient, but still reasonably fast to access.
• Don’t spill to disk unless the functions that computed your
datasets are expensive, or they filter a large amount of the
data. Otherwise, recomputing a partition may be as fast as
reading it from disk.
• Use the replicated storage levels if you want fast fault
recovery
• Use OFF_HEAP in environments with hig amounts of memory
used or multiple applications
35

Shared Variables
• Normally when functions are executed on a remote
node it works on immutable copies
• However, Sparks does provide two types of shared
variables for two usages:
– Broadcast variables
– Accumulators
36

Broadcast Variables
• Broadcast variables allow the programmer to keep a
read-only variable cached on each machine rather
than shipping a copy of it with tasks.
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] =
Broadcast(0)
scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
37

Accumulators
• Accumulators are variables that are only “added” to through
an associative operation and can therefore be efficiently
supported in parallel
• Spark natively supports accumulators of numeric types, and
programmers can add support for new types
• Note: not yet supported on Python
scala> val accum = sc.accumulator(0, "My Accumulator")
accum: spark.Accumulator[Int] = 0
scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
scala> accum.value
res7: Int = 10
38

Accumulators
object VectorAccumulatorParam extends AccumulatorParam[Vector] {
def zero(initialValue: Vector): Vector = {
Vector.zeros(initialValue.size)
}
def addInPlace(v1: Vector, v2: Vector): Vector = {
v1 += v2
}
}
// Then, create an Accumulator of this type:
val vecAccum = sc.accumulator(new Vector(...))(VectorAccumulatorParam)
39

Spark Examples
• Let’s walk through
http://spark.apache.org/examples.html
• Other examples are on
• Basic Sample
=>https://github.com/apache/spark/tree/master/ex
amples/src/main/scala/org/apache/spark/examples
• Streaming Samples =>
https://github.com/apache/spark/tree/master/exam
ples/src/main/scala/org/apache/spark/examples/str
eaming
40

Create a Self Contained App in
Scala
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
41

Create a Self Contained App in
Scala
Create a build.sbt file
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0"
42

Project folder
• That how the project directory should look
$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
• With sbt package we can create the jar
• To submit the job
$ YOUR_SPARK_HOME/bin/spark-submit
--class "SimpleApp"
--master local[4]
target/scala-2.10/simple-project_2.10-1.0.jar
43

Gradle Project
• https://github.com/fabiofumarola/spark-demo
44

A simple example
• We create a local StreamingContext with two execution
threads, and batch interval of 1 second.
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
// Create a local StreamingContext with two working thread and batch
interval of 1 second.
// The master requires 2 cores to prevent from a starvation scenario.
val conf = new
SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
46

A sample example
• Using this context, we can create a DStream that represents
streaming data from a TCP source
val lines = ssc.socketTextStream("localhost", 9999)
• Split each line into words
val words = lines.flatMap(_.split(" "))
• Count each word in the batch
import org.apache.spark.streaming.StreamingContext._
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
47

A sample example
• Note that when these lines are executed, Spark Streaming
only sets up the computation it will perform when it is
started, and no real processing has started yet
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
• Start netcat as data server by using
– Nc –lk 9999
48

A sample example
• If you have already downloaded and built Spark, you
can run this example as follows. You will first need to
run Netcat (a small utility found in most Unix-like
systems) as a data server by using
– nc -lk 9999
• Run the example by
– run-example streaming.NetworkWordCount localhost
9999
• http://spark.apache.org/docs/latest/streaming-
programming-guide.html
49

Scala and spark

In this document

Scala and spark

Editor's Notes

Scala and spark

More Related Content

What's hot

Viewers also liked

Similar to Scala and spark

More from Fabio Fumarola

Recently uploaded

In this document

Scala and spark

Editor's Notes