Apache Spark
Fernando Rodriguez Olivera
@frodriguez
Buenos Aires, Argentina, Nov 2014
JAVACONF 2014
Fernando Rodriguez Olivera
Twitter: @frodriguez
Professor at Universidad Austral (Distributed Systems, Compiler
Design, Operating Systems, …)
Creator of mvnrepository.com
Organizer at Buenos Aires High Scalability Group, Professor at
nosqlessentials.com
Apache Spark
Apache Spark is a Fast and General Engine
for Large-Scale data processing
Supports for Batch, Interactive and Stream
processing with Unified API
In-Memory computing primitives
Hadoop MR Limits
Job Job Job
Hadoop HDFS
- Communication between jobs through FS
- Fault-Tolerance (between jobs) by Persistence to FS
- Memory not managed (relies on OS caches)
MapReduce designed for Batch Processing:
Compensated with: Storm, Samza, Giraph, Impala, Presto, etc
Daytona Gray Sort 100TB Benchmark
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Data Size Time Nodes Cores
Hadoop MR
(2013)
102.5 TB 72 min 2,100
50,400
physical
Apache
Spark
(2014)
100 TB 23 min 206
6,592
virtualized
Daytona Gray Sort 100TB Benchmark
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Data Size Time Nodes Cores
Hadoop MR
(2013)
102.5 TB 72 min 2,100
50,400
physical
Apache
Spark
(2014)
100 TB 23 min 206
6,592
virtualized
3X faster using 10X fewer machines
Hadoop vs Spark for Iterative Proc
source: https://spark.apache.org/
Logistic regression in Hadoop and Spark
Apache Spark
Apache Spark (Core)
Spark
SQL
Spark
Streaming
ML lib GraphX
Powered by Scala and Akka
Resilient Distributed Datasets (RDD)
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Immutable Collection of Objects
Resilient Distributed Datasets (RDD)
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Immutable Collection of Objects
Partitioned and Distributed
Resilient Distributed Datasets (RDD)
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Immutable Collection of Objects
Partitioned and Distributed
Stored in Memory
Resilient Distributed Datasets (RDD)
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Immutable Collection of Objects
Partitioned and Distributed
Stored in Memory
Partitions Recomputed on Failure
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
e.g: apply
function
to count
chars
Compute
Function
(transformation)
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
depends on
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
depends on
N
Int
Action
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
Partitions
Compute Function
Dependencies
Preferred Compute
Location
(for each partition)
RDD Implementation
Partitioner
depends on
N
Int
Action
Spark API
val spark = new SparkContext()
val lines = spark.textFile(“hdfs://docs/”) // RDD[String]
val nonEmpty = lines.filter(l => l.nonEmpty()) // RDD[String]
val count = nonEmpty.count
Scala
SparkContext spark = new SparkContext();
JavaRDD<String> lines = spark.textFile(“hdfs://docs/”)
JavaRDD<String> nonEmpty = lines.filter(l -> l.length() > 0);
long count = nonEmpty.count();
Java8Python
spark = SparkContext()
lines = spark.textFile(“hdfs://docs/”)
nonEmpty = lines.filter(lambda line: len(line) > 0)
count = nonEmpty.count()
RDD Operations
map(func)
flatMap(func)
filter(func)
take(N)
count()
collect()
Transformations Actions
groupByKey()
reduceByKey(func)
reduce(func)
… …
mapValues(func)
takeOrdered(N)
top(N)
Text Processing Example
Top Words by Frequency
(Step by step)
Create RDD from External Data
// Step 1 - Create RDD from Hadoop Text File
val docs = spark.textFile(“/docs/”)
Hadoop FileSystem,
I/O Formats, Codecs
HBaseS3HDFS MongoDB
Cassandra
…
Apache Spark
Spark can read/write from any data source supported by Hadoop
I/O via Hadoop is optional (e.g: Cassandra connector bypass Hadoop)
ElasticSearch
Function map
Hello World
A New Line
hello
...
The end
.map(line => line.toLowerCase)
RDD[String] RDD[String]
hello world
a new line
hello
...
the end
.map(_.toLowerCase)
// Step 2 - Convert lines to lower case
val lower = docs.map(line => line.toLowerCase)
=
Functions map and flatMap
hello world
a new line
hello
...
the end
RDD[String]
Functions map and flatMap
hello world
a new line
hello
...
the end
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
Functions map and flatMap
hello world
a new line
hello
...
the end
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
.flatten
hello
a
...
world
new
line
RDD[String]
*
Functions map and flatMap
hello world
a new line
hello
...
the end
.flatMap(line => line.split(“s+“))
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
.flatten
hello
a
...
world
new
line
RDD[String]
*
Functions map and flatMap
hello world
a new line
hello
...
the end
.flatMap(line => line.split(“s+“))
Note: flatten() not available in spark, only flatMap
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
// Step 3 - Split lines into words
val words = lower.flatMap(line => line.split(“s+“))
.flatten
hello
a
...
world
new
line
RDD[String]
*
Key-Value Pairs
hello
a
...
world
new
line
hello
hello
a
...
world
new
line
hello
.map(word => Tuple2(word, 1))
1
1
1
1
1
1
.map(word => (word, 1))
RDD[String] RDD[(String, Int)]
// Step 4 - Split lines into words
val counts = words.map(word => (word, 1))
=
RDD[Tuple2[String, Int]]
Pair RDD
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
RDD[(String, Int)]
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
world
a
1
1
new 1
line
hello
1
1
.groupByKey
RDD[(String, Iterator[Int])]
1
RDD[(String, Int)]
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
world
a
1
1
new 1
line
hello
1
1
.groupByKey
RDD[(String, Iterator[Int])]
1
RDD[(String, Int)]
world
a
1
1
new 1
line
hello
1
2
.mapValues
_.reduce(…)
(a,b) => a+b
RDD[(String, Int)]
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
.reduceByKey((a, b) => a + b)
world
a
1
1
new 1
line
hello
1
1
.groupByKey
RDD[(String, Iterator[Int])]
1
RDD[(String, Int)]
world
a
1
1
new 1
line
hello
1
2
.mapValues
_.reduce(…)
(a,b) => a+b
RDD[(String, Int)]
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
.reduceByKey((a, b) => a + b)
// Step 5 - Count all words
val freq = counts.reduceByKey(_ + _)
world
a
1
1
new 1
line
hello
1
1
.groupByKey
RDD[(String, Iterator[Int])]
1
RDD[(String, Int)]
world
a
1
1
new 1
line
hello
1
2
.mapValues
_.reduce(…)
(a,b) => a+b
RDD[(String, Int)]
Top N (Prepare data)
world
a
1
1
new 1
line
hello
1
2
// Step 6 - Swap tuples (partial code)
freq.map(_.swap)
.map(_.swap)
world
a
1
1
new1
line
hello
1
2
RDD[(String, Int)] RDD[(Int, String)]
Top N (First Attempt)
world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)]
Top N (First Attempt)
world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)]
.sortByKey
RDD[(Int, String)]
hello
world
2
1
a1
new
line
1
1
(sortByKey(false) for descending)
Top N (First Attempt)
world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)] Array[(Int, String)]
hello
world
2
1
.take(N).sortByKey
RDD[(Int, String)]
hello
world
2
1
a1
new
line
1
1
(sortByKey(false) for descending)
Top N
Array[(Int, String)]
world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)]
world
a
1
1
.top(N)
hello
line
2
1
hello
line
2
1
local top N *
local top N *
reduction
// Step 6 - Swap tuples (complete code)
val top = freq.map(_.swap).top(N)
* local top N implemented by bounded priority queues
val spark = new SparkContext()
// RDD creation from external data source
val docs = spark.textFile(“hdfs://docs/”)
// Split lines into words
val lower = docs.map(line => line.toLowerCase)
val words = lower.flatMap(line => line.split(“s+“))
val counts = words.map(word => (word, 1))
// Count all words (automatic combination)
val freq = counts.reduceByKey(_ + _)
// Swap tuples and get top results
val top = freq.map(_.swap).top(N)
top.foreach(println)
Top Words by Frequency (Full Code)
RDD Persistence (in-memory)
…
...
...
...
...
…
…
…
...
RDD
.cache()
.persist()
.persist(storageLevel)
StorageLevel:
MEMORY_ONLY,
MEMORY_ONLY_SER,
MEMORY_AND_DISK,
MEMORY_AND_DISK_SER,
DISK_ONLY, …
(memory only)
(memory only)
(lazy persistence & caching)
SchemaRDD
Row
...
...
...
...
Row
Row
Row
...
SchemaRDD
RRD of Row + Column Metadata
Queries with SQL
Support for Reflection, JSON,
Parquet, …
SchemaRDD
Row
...
...
...
...
Row
Row
Row
...
topWords
case class Word(text: String, n: Int)
val wordsFreq = freq.map {
case (text, count) => Word(text, count)
} // RDD[Word]
wordsFreq.registerTempTable("wordsFreq")
val topWords = sql("select text, n
from wordsFreq
order by n desc
limit 20”) // RDD[Row]
topWords.collect().foreach(println)
nums = words.filter(_.matches(“[0-9]+”))
RDD Lineage
HadoopRDDwords = sc.textFile(“hdfs://large/file/”)
.map(_.toLowerCase)
alpha.count()
MappedRDD
alpha = words.filter(_.matches(“[a-z]+”))
FlatMappedRDD.flatMap(_.split(“ “))
FilteredRDD
Lineage
(built on the driver
by the transformations)
FilteredRDD
Action (run job on the cluster)
RDD Transformations
Deployment with Hadoop
A
B
C
D
/large/file
Data
Node 1
Data
Node 3
Data
Node 4
Data
Node 2
A A AB BBCC
CD DDRF 3
Name
Node
Spark
Master
Spark
Worker
Spark
Worker
Spark
Worker
Spark
Worker
Client
Submit App
(mode=cluster)
Driver Executors Executors Executors
allocates resources
(cores and memory)
Application
DN + Spark
HDFSSpark
Fernando Rodriguez Olivera
twitter: @frodriguez

Apache Spark with Scala

  • 1.
    Apache Spark Fernando RodriguezOlivera @frodriguez Buenos Aires, Argentina, Nov 2014 JAVACONF 2014
  • 2.
    Fernando Rodriguez Olivera Twitter:@frodriguez Professor at Universidad Austral (Distributed Systems, Compiler Design, Operating Systems, …) Creator of mvnrepository.com Organizer at Buenos Aires High Scalability Group, Professor at nosqlessentials.com
  • 3.
    Apache Spark Apache Sparkis a Fast and General Engine for Large-Scale data processing Supports for Batch, Interactive and Stream processing with Unified API In-Memory computing primitives
  • 4.
    Hadoop MR Limits JobJob Job Hadoop HDFS - Communication between jobs through FS - Fault-Tolerance (between jobs) by Persistence to FS - Memory not managed (relies on OS caches) MapReduce designed for Batch Processing: Compensated with: Storm, Samza, Giraph, Impala, Presto, etc
  • 5.
    Daytona Gray Sort100TB Benchmark source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html Data Size Time Nodes Cores Hadoop MR (2013) 102.5 TB 72 min 2,100 50,400 physical Apache Spark (2014) 100 TB 23 min 206 6,592 virtualized
  • 6.
    Daytona Gray Sort100TB Benchmark source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html Data Size Time Nodes Cores Hadoop MR (2013) 102.5 TB 72 min 2,100 50,400 physical Apache Spark (2014) 100 TB 23 min 206 6,592 virtualized 3X faster using 10X fewer machines
  • 7.
    Hadoop vs Sparkfor Iterative Proc source: https://spark.apache.org/ Logistic regression in Hadoop and Spark
  • 8.
    Apache Spark Apache Spark(Core) Spark SQL Spark Streaming ML lib GraphX Powered by Scala and Akka
  • 9.
    Resilient Distributed Datasets(RDD) Hello World ... ... ... ... A New Line hello The End ... RDD of Strings Immutable Collection of Objects
  • 10.
    Resilient Distributed Datasets(RDD) Hello World ... ... ... ... A New Line hello The End ... RDD of Strings Immutable Collection of Objects Partitioned and Distributed
  • 11.
    Resilient Distributed Datasets(RDD) Hello World ... ... ... ... A New Line hello The End ... RDD of Strings Immutable Collection of Objects Partitioned and Distributed Stored in Memory
  • 12.
    Resilient Distributed Datasets(RDD) Hello World ... ... ... ... A New Line hello The End ... RDD of Strings Immutable Collection of Objects Partitioned and Distributed Stored in Memory Partitions Recomputed on Failure
  • 13.
    RDD Transformations andActions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings
  • 14.
    RDD Transformations andActions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings e.g: apply function to count chars Compute Function (transformation)
  • 15.
    RDD Transformations andActions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings 11 ... ... ... ... 10 5 7 ... RDD of Ints e.g: apply function to count chars Compute Function (transformation)
  • 16.
    RDD Transformations andActions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings 11 ... ... ... ... 10 5 7 ... RDD of Ints e.g: apply function to count chars Compute Function (transformation) depends on
  • 17.
    RDD Transformations andActions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings 11 ... ... ... ... 10 5 7 ... RDD of Ints e.g: apply function to count chars Compute Function (transformation) depends on N Int Action
  • 18.
    RDD Transformations andActions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings 11 ... ... ... ... 10 5 7 ... RDD of Ints e.g: apply function to count chars Compute Function (transformation) Partitions Compute Function Dependencies Preferred Compute Location (for each partition) RDD Implementation Partitioner depends on N Int Action
  • 19.
    Spark API val spark= new SparkContext() val lines = spark.textFile(“hdfs://docs/”) // RDD[String] val nonEmpty = lines.filter(l => l.nonEmpty()) // RDD[String] val count = nonEmpty.count Scala SparkContext spark = new SparkContext(); JavaRDD<String> lines = spark.textFile(“hdfs://docs/”) JavaRDD<String> nonEmpty = lines.filter(l -> l.length() > 0); long count = nonEmpty.count(); Java8Python spark = SparkContext() lines = spark.textFile(“hdfs://docs/”) nonEmpty = lines.filter(lambda line: len(line) > 0) count = nonEmpty.count()
  • 20.
  • 21.
    Text Processing Example TopWords by Frequency (Step by step)
  • 22.
    Create RDD fromExternal Data // Step 1 - Create RDD from Hadoop Text File val docs = spark.textFile(“/docs/”) Hadoop FileSystem, I/O Formats, Codecs HBaseS3HDFS MongoDB Cassandra … Apache Spark Spark can read/write from any data source supported by Hadoop I/O via Hadoop is optional (e.g: Cassandra connector bypass Hadoop) ElasticSearch
  • 23.
    Function map Hello World ANew Line hello ... The end .map(line => line.toLowerCase) RDD[String] RDD[String] hello world a new line hello ... the end .map(_.toLowerCase) // Step 2 - Convert lines to lower case val lower = docs.map(line => line.toLowerCase) =
  • 24.
    Functions map andflatMap hello world a new line hello ... the end RDD[String]
  • 25.
    Functions map andflatMap hello world a new line hello ... the end RDD[String] .map( … ) _.split(“s+”) a hello hello ... the world new line end RDD[Array[String]]
  • 26.
    Functions map andflatMap hello world a new line hello ... the end RDD[String] .map( … ) _.split(“s+”) a hello hello ... the world new line end RDD[Array[String]] .flatten hello a ... world new line RDD[String] *
  • 27.
    Functions map andflatMap hello world a new line hello ... the end .flatMap(line => line.split(“s+“)) RDD[String] .map( … ) _.split(“s+”) a hello hello ... the world new line end RDD[Array[String]] .flatten hello a ... world new line RDD[String] *
  • 28.
    Functions map andflatMap hello world a new line hello ... the end .flatMap(line => line.split(“s+“)) Note: flatten() not available in spark, only flatMap RDD[String] .map( … ) _.split(“s+”) a hello hello ... the world new line end RDD[Array[String]] // Step 3 - Split lines into words val words = lower.flatMap(line => line.split(“s+“)) .flatten hello a ... world new line RDD[String] *
  • 29.
    Key-Value Pairs hello a ... world new line hello hello a ... world new line hello .map(word =>Tuple2(word, 1)) 1 1 1 1 1 1 .map(word => (word, 1)) RDD[String] RDD[(String, Int)] // Step 4 - Split lines into words val counts = words.map(word => (word, 1)) = RDD[Tuple2[String, Int]] Pair RDD
  • 30.
  • 31.
  • 32.
    Shuffling hello a world new line hello 1 1 1 1 1 1 world a 1 1 new 1 line hello 1 1 .groupByKey RDD[(String, Iterator[Int])] 1 RDD[(String,Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b RDD[(String, Int)]
  • 33.
    Shuffling hello a world new line hello 1 1 1 1 1 1 .reduceByKey((a, b) =>a + b) world a 1 1 new 1 line hello 1 1 .groupByKey RDD[(String, Iterator[Int])] 1 RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b RDD[(String, Int)]
  • 34.
    Shuffling hello a world new line hello 1 1 1 1 1 1 .reduceByKey((a, b) =>a + b) // Step 5 - Count all words val freq = counts.reduceByKey(_ + _) world a 1 1 new 1 line hello 1 1 .groupByKey RDD[(String, Iterator[Int])] 1 RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b RDD[(String, Int)]
  • 35.
    Top N (Preparedata) world a 1 1 new 1 line hello 1 2 // Step 6 - Swap tuples (partial code) freq.map(_.swap) .map(_.swap) world a 1 1 new1 line hello 1 2 RDD[(String, Int)] RDD[(Int, String)]
  • 36.
    Top N (FirstAttempt) world a 1 1 new1 line hello 1 2 RDD[(Int, String)]
  • 37.
    Top N (FirstAttempt) world a 1 1 new1 line hello 1 2 RDD[(Int, String)] .sortByKey RDD[(Int, String)] hello world 2 1 a1 new line 1 1 (sortByKey(false) for descending)
  • 38.
    Top N (FirstAttempt) world a 1 1 new1 line hello 1 2 RDD[(Int, String)] Array[(Int, String)] hello world 2 1 .take(N).sortByKey RDD[(Int, String)] hello world 2 1 a1 new line 1 1 (sortByKey(false) for descending)
  • 39.
    Top N Array[(Int, String)] world a 1 1 new1 line hello 1 2 RDD[(Int,String)] world a 1 1 .top(N) hello line 2 1 hello line 2 1 local top N * local top N * reduction // Step 6 - Swap tuples (complete code) val top = freq.map(_.swap).top(N) * local top N implemented by bounded priority queues
  • 40.
    val spark =new SparkContext() // RDD creation from external data source val docs = spark.textFile(“hdfs://docs/”) // Split lines into words val lower = docs.map(line => line.toLowerCase) val words = lower.flatMap(line => line.split(“s+“)) val counts = words.map(word => (word, 1)) // Count all words (automatic combination) val freq = counts.reduceByKey(_ + _) // Swap tuples and get top results val top = freq.map(_.swap).top(N) top.foreach(println) Top Words by Frequency (Full Code)
  • 41.
  • 42.
    SchemaRDD Row ... ... ... ... Row Row Row ... SchemaRDD RRD of Row+ Column Metadata Queries with SQL Support for Reflection, JSON, Parquet, …
  • 43.
    SchemaRDD Row ... ... ... ... Row Row Row ... topWords case class Word(text:String, n: Int) val wordsFreq = freq.map { case (text, count) => Word(text, count) } // RDD[Word] wordsFreq.registerTempTable("wordsFreq") val topWords = sql("select text, n from wordsFreq order by n desc limit 20”) // RDD[Row] topWords.collect().foreach(println)
  • 44.
    nums = words.filter(_.matches(“[0-9]+”)) RDDLineage HadoopRDDwords = sc.textFile(“hdfs://large/file/”) .map(_.toLowerCase) alpha.count() MappedRDD alpha = words.filter(_.matches(“[a-z]+”)) FlatMappedRDD.flatMap(_.split(“ “)) FilteredRDD Lineage (built on the driver by the transformations) FilteredRDD Action (run job on the cluster) RDD Transformations
  • 45.
    Deployment with Hadoop A B C D /large/file Data Node1 Data Node 3 Data Node 4 Data Node 2 A A AB BBCC CD DDRF 3 Name Node Spark Master Spark Worker Spark Worker Spark Worker Spark Worker Client Submit App (mode=cluster) Driver Executors Executors Executors allocates resources (cores and memory) Application DN + Spark HDFSSpark
  • 46.