SlideShare a Scribd company logo
1 of 49
Introduction to
Scala and Spark
Ciao
ciao
Vai a fare
ciao ciao
Dr. Fabio Fumarola
Contents
• Hadoop quick introduction
• An introduction to spark
• Spark – Architecture & Programming Model
2
Hadoop
• An Open-Source software for distributed storage of large
dataset on commodity hardware
• Provides a programming model/framework for processing
large dataset in parallel
3
Map
Map
Map
Reduce
Reduce
Input Output
Limitations of Map Reduce
• Slow due to replication, serialization, and disk IO
• Inefficient for:
– Iterative algorithms (Machine Learning, Graphs & Network Analysis)
– Interactive Data Mining (R, Excel, Ad hoc Reporting, Searching)
4
Input iter. 1iter. 1 iter. 2iter. 2 . . .
HDFS
read
HDFS
write
HDFS
read
HDFS
write
Map
Map
Map
Reduce
Reduce
Input Output
Solutions?
• Leverage to memory:
– load Data into Memory
– Replace disks with SSD
5
Apache Spark
• A big data analytics cluster-computing framework
written in Scala.
• Open Sourced originally in AMPLab at UC Berkley
• Provides in-memory analytics based on RDD
• Highly compatible with Hadoop Storage API
– Can run on top of an Hadoop cluster
• Developer can write programs using multiple
programming languages
6
Spark architecture
7
HDFS
Datanode Datanode Datanode....
Spark
Worker
Spark
Worker
Spark
Worker
....
CacheCache CacheCache CacheCache
Block Block Block
Cluster Manager
Spark Driver (Master)
Spark
8
iter. 1iter. 1 iter. 2iter. 2 . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
Spark
9
iter. 1iter. 1 iter. 2iter. 2 . . .
Input
Not tied to 2 stage Map
Reduce paradigm
1. Extract a working set
2. Cache it
3. Query it repeatedly
Logistic regression in Hadoop and Spark
HDFS
read
Spark Programming Model
10
Datanode
HDFS
Datanode…
User
(Developer)
Writes
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
Driver Program
SparkContextSparkContext
Cluster
Manager
Cluster
Manager
Worker Node
ExecuterExecuter CacheCache
TaskTask TaskTask
Worker Node
ExecuterExecuter CacheCache
TaskTask TaskTask
Spark Programming Model
11
User
(Developer)
Writes
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
Driver Program
RDD
(Resilient
Distributed
Dataset)
RDD
(Resilient
Distributed
Dataset)
• Immutable Data structure
• In-memory (explicitly)
• Fault Tolerant
• Parallel Data Structure
• Controlled partitioning to
optimize data placement
• Can be manipulated using
rich set of operators.
• Immutable Data structure
• In-memory (explicitly)
• Fault Tolerant
• Parallel Data Structure
• Controlled partitioning to
optimize data placement
• Can be manipulated using
rich set of operators.
RDD
• Programming Interface: Programmer can perform 3
types of operations
12
Transformations
•Create a new dataset
from and existing one.
•Lazy in nature. They are
executed only when some
action is performed.
•Example :
• Map(func)
• Filter(func)
• Distinct()
Transformations
•Create a new dataset
from and existing one.
•Lazy in nature. They are
executed only when some
action is performed.
•Example :
• Map(func)
• Filter(func)
• Distinct()
Actions
•Returns to the driver
program a value or
exports data to a storage
system after performing a
computation.
•Example:
• Count()
• Reduce(funct)
• Collect
• Take()
Actions
•Returns to the driver
program a value or
exports data to a storage
system after performing a
computation.
•Example:
• Count()
• Reduce(funct)
• Collect
• Take()
Persistence
•For caching datasets in-
memory for future
operations.
•Option to store on disk or
RAM or mixed (Storage
Level).
•Example:
• Persist()
• Cache()
Persistence
•For caching datasets in-
memory for future
operations.
•Option to store on disk or
RAM or mixed (Storage
Level).
•Example:
• Persist()
• Cache()
How Spark works
• RDD: Parallel collection with partitions
• User application create RDDs, transform them, and
run actions.
• This results in a DAG (Directed Acyclic Graph) of
operators.
• DAG is compiled into stages
• Each stage is executed as a series of Task (one Task
for each Partition).
13
Example
14
sc.textFile(“/wiki/pagecounts”) RDD[String]
textFile
Example
15
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(“t”))
RDD[String]
textFile map
RDD[List[String]]
Example
16
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(“t”))
.map(R => (R[0], int(R[1])))
RDD[String]
textFile map
RDD[List[String]]
RDD[(String, Int)]
map
Example
17
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(“t”))
.map(R => (R[0], int(R[1])))
.reduceByKey(_+_)
RDD[String]
textFile map
RDD[List[String]]
RDD[(String, Int)]
map
RDD[(String, Int)]
reduceByKey
Example
18
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(“t”))
.map(R => (R[0], int(R[1])))
.reduceByKey(_+_, 3)
.collect()
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
RDD[(String, Int)]
reduceByKey
Array[(String, Int)]
collect
Execution Plan
Stages are sequences of RDDs, that don’t have a Shuffle in
between
19
textFile map map
reduceByKey
collect
Stage 1 Stage 2
Execution Plan
20
textFile map map
reduceByK
ey
collect
Stage
1
Stage
2
Stage
1
Stage
2
1. Read HDFS split
2. Apply both the maps
3. Start Partial reduce
4. Write shuffle data
1. Read shuffle data
2. Final reduce
3. Send result to driver
program
Stage Execution
• Create a task for each Partition in the new RDD
• Serialize the Task
• Schedule and ship Tasks to Slaves
And all this happens internally (you need to do anything)
21
Task 1
Task 2
Task 2
Task 2
Spark Executor (Slaves)
22
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Core 1
Core 2
Core 3
Summary of Components
• Task: The fundamental unit of execution in Spark
• Stage : Set of Tasks that run parallel
• DAG: Logical Graph of RDD operations
• RDD: Parallel dataset with partitions
23
Start the docker container
From
•https://github.com/sequenceiq/docker-spark
docker run -i -t -h sandbox sequenceiq/spark:1.1.1-ubuntu
/etc/ bootstrap.sh –bash
•Run the spark shell using yarn or local
spark-shell --master yarn-client --driver-memory 1g --executor-memory
1g --executor-cores 2
24
Running the example and Shell
• To Run the examples
– $ run-example SparkPi 10
• We can start a spark shell via
– spark-shell -- master local n
• The -- master specifies the master URL for a
distributed cluster
• Example applications are also provided in Python
– spark-submit example/src/main/python/pi.py 10
25
Collections and External Datasets
• A Collection can be parallelized using the SparkContext
– val data = Array(1, 2, 3, 4, 5)
– val distData = sc.parallelize(data)
• Spark can create distributed dataset from HDFS, Cassandra,
Hbase, Amazon S3, etc.
• Spark supports text files, Sequence Files and any other
Hadoop input format
• Files can be read from an URI local or remote (hdfs://, s3n://)
– scala> val distFile = sc.textFile("data.txt")
– distFile: RDD[String] = MappedRDD@1d4cee08
– distFile.map(s => s.length).reduce((a,b) => a + b)
26
RDD operations
• Count the length of the words in the file
– val lines = sc.textFile("data.txt")
– val lineLengths = lines.map(s => s.length)
– val totalLength = lineLengths.reduce((a, b) => a + b)
• If we want to use lineLengths later we can run
– lineLengths.persist()
• This will save in the memory the value of lineLengths
before reducing
27
Passing a function to Spark
• Spark is based on Anonymous function syntax
– (x: Int) => x *x
• Which is a shorthand for
new Function1[Int,Int] {
def apply(x: Int) = x * x
}
• We can define functions with more parameters and without
– (x: Int, y: Int) => "(" + x + ", " + y + ")”
– () => { System.getProperty("user.dir") }
• The syntax is a shorthand for
– Funtion1[T,+E] … Function22[…]
28
Passing a function to Spark
object MyFunctions {
def func1(s: String): String = s + s
}
file.map(MyFunctions.func1)
class MyClass {
def func1(s: String): String = { ... }
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(func1) }
}
29
Working with Key-Value Pairs
• We can setup RDD with key-value pairs that are
caster to Tuple2 type
– val lines = sc.textFile("data.txt")
– val pairs = lines.map(s => (s, 1))
– val counts = pairs.reduceByKey((a, b) => a + b)
• We can use counts.sortByKey() to sort
• And finally counts.collect() to bring them back
• NOTE: when using custom objects as key-value we
should be sure that they have the method equals()
with hashcode()
http://docs.oracle.com/javase/7/docs/api/java/lang/Object.h
tml#hashCode() 30
Transformations
• There are several transformations supported by
Spark
– Map
– Filter
– flatMap
– mapPartitions
– ….
– http://spark.apache.org/docs/latest/programming-
guide.html
• When they are executed?
31
Actions
• The following table lists some of the common actions
supported:
– Reduce
– Collect
– Count
– First
– Take
– takeSample
32
RDD Persistence
• One of the most important capabilities in Spark is persisting
(or caching) a dataset in memory across operations
• Caching is a key tool for iterative algorithms and fast
interactive use
• You can mark an RDD to be persisted using the persist() or
cache() methods on it
• The first time it is computed in an action, it will be kept in
memory on the nodes. Spark’s cache is fault-tolerant – if any
partition of an RDD is lost, it will automatically be
recomputed using the transformations that originally created
it.
33
RDD persistence
• In addition, each persisted RDD can be stored using a
different storage level,
• for example we can persist
– the dataset on disk,
– in memory but as serialized Java objects (to save space), replicate it
across nodes,
– off-heap in Tachyon
• Note: In Python, stored objects will always be serialized with
the Pickle library, so it does not matter whether you choose a
serialized level.
• Spark also automatically persists some intermediate data in
shuffle operations (e.g. reduceByKey), even without users
calling persist
34
Which Storage Level to Choose?
• Memory only if that fit in the main memory
• If not, try using MEMORY_ONLY_SER and selecting a fast
serialization library to make the objects much more space-
efficient, but still reasonably fast to access.
• Don’t spill to disk unless the functions that computed your
datasets are expensive, or they filter a large amount of the
data. Otherwise, recomputing a partition may be as fast as
reading it from disk.
• Use the replicated storage levels if you want fast fault
recovery
• Use OFF_HEAP in environments with hig amounts of memory
used or multiple applications
35
Shared Variables
• Normally when functions are executed on a remote
node it works on immutable copies
• However, Sparks does provide two types of shared
variables for two usages:
– Broadcast variables
– Accumulators
36
Broadcast Variables
• Broadcast variables allow the programmer to keep a
read-only variable cached on each machine rather
than shipping a copy of it with tasks.
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] =
Broadcast(0)
scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
37
Accumulators
• Accumulators are variables that are only “added” to through
an associative operation and can therefore be efficiently
supported in parallel
• Spark natively supports accumulators of numeric types, and
programmers can add support for new types
• Note: not yet supported on Python
scala> val accum = sc.accumulator(0, "My Accumulator")
accum: spark.Accumulator[Int] = 0
scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
scala> accum.value
res7: Int = 10
38
Accumulators
object VectorAccumulatorParam extends AccumulatorParam[Vector] {
def zero(initialValue: Vector): Vector = {
Vector.zeros(initialValue.size)
}
def addInPlace(v1: Vector, v2: Vector): Vector = {
v1 += v2
}
}
// Then, create an Accumulator of this type:
val vecAccum = sc.accumulator(new Vector(...))(VectorAccumulatorParam)
39
Spark Examples
• Let’s walk through
http://spark.apache.org/examples.html
• Other examples are on
• Basic Sample
=>https://github.com/apache/spark/tree/master/ex
amples/src/main/scala/org/apache/spark/examples
• Streaming Samples =>
https://github.com/apache/spark/tree/master/exam
ples/src/main/scala/org/apache/spark/examples/str
eaming
40
Create a Self Contained App in
Scala
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
41
Create a Self Contained App in
Scala
Create a build.sbt file
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0"
42
Project folder
• That how the project directory should look
$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
• With sbt package we can create the jar
• To submit the job
$ YOUR_SPARK_HOME/bin/spark-submit 
--class "SimpleApp" 
--master local[4] 
target/scala-2.10/simple-project_2.10-1.0.jar
43
Gradle Project
• https://github.com/fabiofumarola/spark-demo
44
Spark Streaming
45
A simple example
• We create a local StreamingContext with two execution
threads, and batch interval of 1 second.
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
// Create a local StreamingContext with two working thread and batch
interval of 1 second.
// The master requires 2 cores to prevent from a starvation scenario.
val conf = new
SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
46
A sample example
• Using this context, we can create a DStream that represents
streaming data from a TCP source
val lines = ssc.socketTextStream("localhost", 9999)
• Split each line into words
val words = lines.flatMap(_.split(" "))
• Count each word in the batch
import org.apache.spark.streaming.StreamingContext._
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
47
A sample example
• Note that when these lines are executed, Spark Streaming
only sets up the computation it will perform when it is
started, and no real processing has started yet
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
• Start netcat as data server by using
– Nc –lk 9999
48
A sample example
• If you have already downloaded and built Spark, you
can run this example as follows. You will first need to
run Netcat (a small utility found in most Unix-like
systems) as a data server by using
– nc -lk 9999
• Run the example by
– run-example streaming.NetworkWordCount localhost
9999
• http://spark.apache.org/docs/latest/streaming-
programming-guide.html
49

More Related Content

What's hot

Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 

What's hot (20)

Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Spark
SparkSpark
Spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Sqoop
SqoopSqoop
Sqoop
 
Sql Antipatterns Strike Back
Sql Antipatterns Strike BackSql Antipatterns Strike Back
Sql Antipatterns Strike Back
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 

Viewers also liked

Viewers also liked (7)

Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
Scala Intro
Scala IntroScala Intro
Scala Intro
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scala
 
A Brief Intro to Scala
A Brief Intro to ScalaA Brief Intro to Scala
A Brief Intro to Scala
 

Similar to Scala and spark

11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentationpunesparkmeetup
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 

Similar to Scala and spark (20)

11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Data Science
Data ScienceData Science
Data Science
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Spark core
Spark coreSpark core
Spark core
 

More from Fabio Fumarola

10b. Graph Databases Lab
10b. Graph Databases Lab10b. Graph Databases Lab
10b. Graph Databases LabFabio Fumarola
 
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases labFabio Fumarola
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented DatabasesFabio Fumarola
 
8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab8b. Column Oriented Databases Lab
8b. Column Oriented Databases LabFabio Fumarola
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with DockerFabio Fumarola
 
8. column oriented databases
8. column oriented databases8. column oriented databases
8. column oriented databasesFabio Fumarola
 
8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory Fabio Fumarola
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2Fabio Fumarola
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2Fabio Fumarola
 
2 Linux Container and Docker
2 Linux Container and Docker2 Linux Container and Docker
2 Linux Container and DockerFabio Fumarola
 
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...Fabio Fumarola
 
An introduction to maven gradle and sbt
An introduction to maven gradle and sbtAn introduction to maven gradle and sbt
An introduction to maven gradle and sbtFabio Fumarola
 
Develop with linux containers and docker
Develop with linux containers and dockerDevelop with linux containers and docker
Develop with linux containers and dockerFabio Fumarola
 
Linux containers and docker
Linux containers and dockerLinux containers and docker
Linux containers and dockerFabio Fumarola
 
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce Fabio Fumarola
 

More from Fabio Fumarola (20)

10b. Graph Databases Lab
10b. Graph Databases Lab10b. Graph Databases Lab
10b. Graph Databases Lab
 
10. Graph Databases
10. Graph Databases10. Graph Databases
10. Graph Databases
 
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
 
8. column oriented databases
8. column oriented databases8. column oriented databases
8. column oriented databases
 
8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2
 
3 Git
3 Git3 Git
3 Git
 
2 Linux Container and Docker
2 Linux Container and Docker2 Linux Container and Docker
2 Linux Container and Docker
 
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
 
Hbase an introduction
Hbase an introductionHbase an introduction
Hbase an introduction
 
An introduction to maven gradle and sbt
An introduction to maven gradle and sbtAn introduction to maven gradle and sbt
An introduction to maven gradle and sbt
 
Develop with linux containers and docker
Develop with linux containers and dockerDevelop with linux containers and docker
Develop with linux containers and docker
 
Linux containers and docker
Linux containers and dockerLinux containers and docker
Linux containers and docker
 
08 datasets
08 datasets08 datasets
08 datasets
 
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
 

Recently uploaded

Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 

Recently uploaded (20)

Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 

Scala and spark

  • 1. Introduction to Scala and Spark Ciao ciao Vai a fare ciao ciao Dr. Fabio Fumarola
  • 2. Contents • Hadoop quick introduction • An introduction to spark • Spark – Architecture & Programming Model 2
  • 3. Hadoop • An Open-Source software for distributed storage of large dataset on commodity hardware • Provides a programming model/framework for processing large dataset in parallel 3 Map Map Map Reduce Reduce Input Output
  • 4. Limitations of Map Reduce • Slow due to replication, serialization, and disk IO • Inefficient for: – Iterative algorithms (Machine Learning, Graphs & Network Analysis) – Interactive Data Mining (R, Excel, Ad hoc Reporting, Searching) 4 Input iter. 1iter. 1 iter. 2iter. 2 . . . HDFS read HDFS write HDFS read HDFS write Map Map Map Reduce Reduce Input Output
  • 5. Solutions? • Leverage to memory: – load Data into Memory – Replace disks with SSD 5
  • 6. Apache Spark • A big data analytics cluster-computing framework written in Scala. • Open Sourced originally in AMPLab at UC Berkley • Provides in-memory analytics based on RDD • Highly compatible with Hadoop Storage API – Can run on top of an Hadoop cluster • Developer can write programs using multiple programming languages 6
  • 7. Spark architecture 7 HDFS Datanode Datanode Datanode.... Spark Worker Spark Worker Spark Worker .... CacheCache CacheCache CacheCache Block Block Block Cluster Manager Spark Driver (Master)
  • 8. Spark 8 iter. 1iter. 1 iter. 2iter. 2 . . . Input HDFS read HDFS write HDFS read HDFS write
  • 9. Spark 9 iter. 1iter. 1 iter. 2iter. 2 . . . Input Not tied to 2 stage Map Reduce paradigm 1. Extract a working set 2. Cache it 3. Query it repeatedly Logistic regression in Hadoop and Spark HDFS read
  • 10. Spark Programming Model 10 Datanode HDFS Datanode… User (Developer) Writes sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map Driver Program SparkContextSparkContext Cluster Manager Cluster Manager Worker Node ExecuterExecuter CacheCache TaskTask TaskTask Worker Node ExecuterExecuter CacheCache TaskTask TaskTask
  • 11. Spark Programming Model 11 User (Developer) Writes sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map Driver Program RDD (Resilient Distributed Dataset) RDD (Resilient Distributed Dataset) • Immutable Data structure • In-memory (explicitly) • Fault Tolerant • Parallel Data Structure • Controlled partitioning to optimize data placement • Can be manipulated using rich set of operators. • Immutable Data structure • In-memory (explicitly) • Fault Tolerant • Parallel Data Structure • Controlled partitioning to optimize data placement • Can be manipulated using rich set of operators.
  • 12. RDD • Programming Interface: Programmer can perform 3 types of operations 12 Transformations •Create a new dataset from and existing one. •Lazy in nature. They are executed only when some action is performed. •Example : • Map(func) • Filter(func) • Distinct() Transformations •Create a new dataset from and existing one. •Lazy in nature. They are executed only when some action is performed. •Example : • Map(func) • Filter(func) • Distinct() Actions •Returns to the driver program a value or exports data to a storage system after performing a computation. •Example: • Count() • Reduce(funct) • Collect • Take() Actions •Returns to the driver program a value or exports data to a storage system after performing a computation. •Example: • Count() • Reduce(funct) • Collect • Take() Persistence •For caching datasets in- memory for future operations. •Option to store on disk or RAM or mixed (Storage Level). •Example: • Persist() • Cache() Persistence •For caching datasets in- memory for future operations. •Option to store on disk or RAM or mixed (Storage Level). •Example: • Persist() • Cache()
  • 13. How Spark works • RDD: Parallel collection with partitions • User application create RDDs, transform them, and run actions. • This results in a DAG (Directed Acyclic Graph) of operators. • DAG is compiled into stages • Each stage is executed as a series of Task (one Task for each Partition). 13
  • 16. Example 16 sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) RDD[String] textFile map RDD[List[String]] RDD[(String, Int)] map
  • 17. Example 17 sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_) RDD[String] textFile map RDD[List[String]] RDD[(String, Int)] map RDD[(String, Int)] reduceByKey
  • 18. Example 18 sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_, 3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] reduceByKey Array[(String, Int)] collect
  • 19. Execution Plan Stages are sequences of RDDs, that don’t have a Shuffle in between 19 textFile map map reduceByKey collect Stage 1 Stage 2
  • 20. Execution Plan 20 textFile map map reduceByK ey collect Stage 1 Stage 2 Stage 1 Stage 2 1. Read HDFS split 2. Apply both the maps 3. Start Partial reduce 4. Write shuffle data 1. Read shuffle data 2. Final reduce 3. Send result to driver program
  • 21. Stage Execution • Create a task for each Partition in the new RDD • Serialize the Task • Schedule and ship Tasks to Slaves And all this happens internally (you need to do anything) 21 Task 1 Task 2 Task 2 Task 2
  • 22. Spark Executor (Slaves) 22 Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Core 1 Core 2 Core 3
  • 23. Summary of Components • Task: The fundamental unit of execution in Spark • Stage : Set of Tasks that run parallel • DAG: Logical Graph of RDD operations • RDD: Parallel dataset with partitions 23
  • 24. Start the docker container From •https://github.com/sequenceiq/docker-spark docker run -i -t -h sandbox sequenceiq/spark:1.1.1-ubuntu /etc/ bootstrap.sh –bash •Run the spark shell using yarn or local spark-shell --master yarn-client --driver-memory 1g --executor-memory 1g --executor-cores 2 24
  • 25. Running the example and Shell • To Run the examples – $ run-example SparkPi 10 • We can start a spark shell via – spark-shell -- master local n • The -- master specifies the master URL for a distributed cluster • Example applications are also provided in Python – spark-submit example/src/main/python/pi.py 10 25
  • 26. Collections and External Datasets • A Collection can be parallelized using the SparkContext – val data = Array(1, 2, 3, 4, 5) – val distData = sc.parallelize(data) • Spark can create distributed dataset from HDFS, Cassandra, Hbase, Amazon S3, etc. • Spark supports text files, Sequence Files and any other Hadoop input format • Files can be read from an URI local or remote (hdfs://, s3n://) – scala> val distFile = sc.textFile("data.txt") – distFile: RDD[String] = MappedRDD@1d4cee08 – distFile.map(s => s.length).reduce((a,b) => a + b) 26
  • 27. RDD operations • Count the length of the words in the file – val lines = sc.textFile("data.txt") – val lineLengths = lines.map(s => s.length) – val totalLength = lineLengths.reduce((a, b) => a + b) • If we want to use lineLengths later we can run – lineLengths.persist() • This will save in the memory the value of lineLengths before reducing 27
  • 28. Passing a function to Spark • Spark is based on Anonymous function syntax – (x: Int) => x *x • Which is a shorthand for new Function1[Int,Int] { def apply(x: Int) = x * x } • We can define functions with more parameters and without – (x: Int, y: Int) => "(" + x + ", " + y + ")” – () => { System.getProperty("user.dir") } • The syntax is a shorthand for – Funtion1[T,+E] … Function22[…] 28
  • 29. Passing a function to Spark object MyFunctions { def func1(s: String): String = s + s } file.map(MyFunctions.func1) class MyClass { def func1(s: String): String = { ... } def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(func1) } } 29
  • 30. Working with Key-Value Pairs • We can setup RDD with key-value pairs that are caster to Tuple2 type – val lines = sc.textFile("data.txt") – val pairs = lines.map(s => (s, 1)) – val counts = pairs.reduceByKey((a, b) => a + b) • We can use counts.sortByKey() to sort • And finally counts.collect() to bring them back • NOTE: when using custom objects as key-value we should be sure that they have the method equals() with hashcode() http://docs.oracle.com/javase/7/docs/api/java/lang/Object.h tml#hashCode() 30
  • 31. Transformations • There are several transformations supported by Spark – Map – Filter – flatMap – mapPartitions – …. – http://spark.apache.org/docs/latest/programming- guide.html • When they are executed? 31
  • 32. Actions • The following table lists some of the common actions supported: – Reduce – Collect – Count – First – Take – takeSample 32
  • 33. RDD Persistence • One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations • Caching is a key tool for iterative algorithms and fast interactive use • You can mark an RDD to be persisted using the persist() or cache() methods on it • The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it. 33
  • 34. RDD persistence • In addition, each persisted RDD can be stored using a different storage level, • for example we can persist – the dataset on disk, – in memory but as serialized Java objects (to save space), replicate it across nodes, – off-heap in Tachyon • Note: In Python, stored objects will always be serialized with the Pickle library, so it does not matter whether you choose a serialized level. • Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist 34
  • 35. Which Storage Level to Choose? • Memory only if that fit in the main memory • If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space- efficient, but still reasonably fast to access. • Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk. • Use the replicated storage levels if you want fast fault recovery • Use OFF_HEAP in environments with hig amounts of memory used or multiple applications 35
  • 36. Shared Variables • Normally when functions are executed on a remote node it works on immutable copies • However, Sparks does provide two types of shared variables for two usages: – Broadcast variables – Accumulators 36
  • 37. Broadcast Variables • Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. scala> val broadcastVar = sc.broadcast(Array(1, 2, 3)) broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0) scala> broadcastVar.value res0: Array[Int] = Array(1, 2, 3) 37
  • 38. Accumulators • Accumulators are variables that are only “added” to through an associative operation and can therefore be efficiently supported in parallel • Spark natively supports accumulators of numeric types, and programmers can add support for new types • Note: not yet supported on Python scala> val accum = sc.accumulator(0, "My Accumulator") accum: spark.Accumulator[Int] = 0 scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) scala> accum.value res7: Int = 10 38
  • 39. Accumulators object VectorAccumulatorParam extends AccumulatorParam[Vector] { def zero(initialValue: Vector): Vector = { Vector.zeros(initialValue.size) } def addInPlace(v1: Vector, v2: Vector): Vector = { v1 += v2 } } // Then, create an Accumulator of this type: val vecAccum = sc.accumulator(new Vector(...))(VectorAccumulatorParam) 39
  • 40. Spark Examples • Let’s walk through http://spark.apache.org/examples.html • Other examples are on • Basic Sample =>https://github.com/apache/spark/tree/master/ex amples/src/main/scala/org/apache/spark/examples • Streaming Samples => https://github.com/apache/spark/tree/master/exam ples/src/main/scala/org/apache/spark/examples/str eaming 40
  • 41. Create a Self Contained App in Scala /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } } 41
  • 42. Create a Self Contained App in Scala Create a build.sbt file name := "Simple Project" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0" 42
  • 43. Project folder • That how the project directory should look $ find . . ./simple.sbt ./src ./src/main ./src/main/scala ./src/main/scala/SimpleApp.scala • With sbt package we can create the jar • To submit the job $ YOUR_SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar 43
  • 46. A simple example • We create a local StreamingContext with two execution threads, and batch interval of 1 second. import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ // Create a local StreamingContext with two working thread and batch interval of 1 second. // The master requires 2 cores to prevent from a starvation scenario. val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) 46
  • 47. A sample example • Using this context, we can create a DStream that represents streaming data from a TCP source val lines = ssc.socketTextStream("localhost", 9999) • Split each line into words val words = lines.flatMap(_.split(" ")) • Count each word in the batch import org.apache.spark.streaming.StreamingContext._ val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() 47
  • 48. A sample example • Note that when these lines are executed, Spark Streaming only sets up the computation it will perform when it is started, and no real processing has started yet ssc.start() // Start the computation ssc.awaitTermination() // Wait for the computation to terminate • Start netcat as data server by using – Nc –lk 9999 48
  • 49. A sample example • If you have already downloaded and built Spark, you can run this example as follows. You will first need to run Netcat (a small utility found in most Unix-like systems) as a data server by using – nc -lk 9999 • Run the example by – run-example streaming.NetworkWordCount localhost 9999 • http://spark.apache.org/docs/latest/streaming- programming-guide.html 49

Editor's Notes

  1. Resilient Distributed Datasets or RDD are the distributed memory abstractions that lets programmer perform in-memory parallel computations on large clusters. And that too in a highly fault tolerant manner. This is the main concept around which the whole Spark framework revolves around. Currently 2 types of RDDs: Parallelized collections: Created by calling parallelize method on an existing Scala collection. Developer can specify the number of slices to cut the dataset into. Ideally 2-3 slices per CPU. Hadoop Datasets: These distributed datasets are created from any file stored on HDFS or other storage systems supported by Hadoop (S3, Hbase etc). These are created using SparkContext’s textFile method. Default number of slices in this case is 1 slice per file block.
  2. Transformations: Like map – takes an RDD as an input, passes & process each element to a function, and return a new transformed RDD as an output. By default, each transformed RDD is recomputed each time you run an action on it. Unless you specify the RDD to be cached in memory. Spark will try to keep the elements around the cluster for faster access. RDD can be persisted on discs as well. Caching is the Key tool for iterative algorithms. Using persist, one can specify the Storage Level for persisting an RDD. Cache is just a short hand for default storage level. Which is MEMORY_ONLY. MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISK_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2 etc Same as the levels above, but replicate each partition on two cluster nodes. Which Storage level is best: Few things to consider: Try to keep in-memory as much as possible Try not to spill to disc unless your computed datasets are memory expensive Use replication only if you want fault tolerance