2 
24.11.2014 
uweseiler 
Apache Spark
2 About me 
24.11.2014 
Big Data Nerd 
Hadoop Trainer NoSQL Fan Boy 
Photography Enthusiast Travelpirate
2 About us 
24.11.2014 
specializes on... 
Big Data Nerds Agile Ninjas Continuous Delivery Gurus 
Enterprise Java Specialists Performance Geeks 
Join us!
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark: In a tweet 
24.11.2014 
“Spark … is what you might 
call a Swiss Army knife of Big 
Data analytics tools” 
– Reynold Xin (@rxin), Berkeley AmpLab Shark Development Lead
2 Spark: In a nutshell 
24.11.2014 
• Fast and general engine for large scale data 
processing 
• Advanced DAG execution engine with support for 
 in-memory storage 
 data locality 
 (micro) batch  streaming support 
• Improves usability via 
 Rich APIs in Scala, Java, Python 
 Interactive shell 
• Runs Standalone, on YARN, on Mesos, and on 
Amazon EC2
2 Spark is also… 
24.11.2014 
• Came out of AMPLab at UCB in 2009 
• A top-level Apache project as of 2014 
– http://spark.apache.org 
• Backed by a commercial entity: Databricks 
• A toolset for Data Scientist / Analysts 
• Implementation of Resilient Distributed Dataset 
(RDD) in Scala 
• Hadoop Compatible
2 Spark: Trends 
24.11.2014 
Apache Drill Apache Storm Apache Spark Apache YARN Apache Tez 
Generated using http://www.google.com/trends/
2 Spark: Community 
24.11.2014 
https://github.com/apache/spark/pulse
2 Spark: Performance 
24.11.2014 
3X faster using 10X fewer machines 
http://finance.yahoo.com/news/apache-spark-beats-world-record-130000796.html 
http://www.wired.com/2014/10/startup-crunches-100-terabytes-data-record-23-minutes/
2 
24.11.2014 
BlinkDB 
MapReduce 
Cluster resource mgmt. + data 
processing 
HDFS 
Spark: Ecosystem 
Redundant, reliable storage 
Spark Core 
Spark 
SQL 
SQL 
Spark 
Streaming 
Streaming 
MLlib 
Machine 
Learning 
SparkR 
R on Spark 
GraphX 
Graph 
Computation
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark: Core Concept 
24.11.2014 
• Resilient Distributed Dataset (RDD) 
Conceptually, RDDs can be roughly 
viewed as partitioned, locality aware 
distributed vectors 
RDD 
A11 
A12 
A13 
• Read-only collection of objects spread across a 
cluster 
• Built through parallel transformations  actions 
• Computation can be represented by lazy evaluated 
lineage DAGs composed by connected RDDs 
• Automatically rebuilt on failure 
• Controllable persistence
2 Spark: RDD Example 
24.11.2014 
Base RDD from HDFS 
lines = spark.textFile(“hdfs://...”) 
errors = 
lines.filter(_.startsWith(Error)) 
messages = errors.map(_.split('t')(2)) 
messages.cache() 
RDD in memory 
Iterative Processing 
for (str - Array(“foo”, “bar”)) 
messages.filter(_.contains(str)).count()
2 Spark: Transformations 
24.11.2014 
Transformations - 
Create new datasets from existing ones 
map
2 Spark: Transformations 
24.11.2014 
Transformations - 
Create new datasets from existing ones 
map(func) 
filter(func) 
flatMap(func) 
mapPartitions(func) 
mapPartitionsWithIndex(func) 
union(otherDataset) 
intersection(otherDataset) 
distinct([numTasks])) 
groupByKey([numTasks]) 
sortByKey([ascending], [numTasks]) 
reduceByKey(func, [numTasks]) 
aggregateByKey(zeroValue)(seqOp, 
combOp, [numTasks]) 
join(otherDataset, [numTasks]) 
cogroup(otherDataset, [numTasks]) 
cartesian(otherDataset) 
pipe(command, [envVars]) 
coalesce(numPartitions) 
sample(withReplacement,fraction, seed) 
repartition(numPartitions)
2 Spark: Actions 
24.11.2014 
Actions - 
Return a value to the client after running a 
computation on the dataset 
reduce
2 Spark: Actions 
24.11.2014 
Actions - 
Return a value to the client after running a 
computation on the dataset 
reduce(func) 
collect() 
count() 
first() 
countByKey() 
foreach(func) 
take(n) 
takeSample(withReplacement,num, [seed]) 
takeOrdered(n, [ordering]) 
saveAsTextFile(path) 
saveAsSequenceFile(path) 
(Only Java and Scala) 
saveAsObjectFile(path) 
(Only Java and Scala)
2 Spark: Dataflow 
24.11.2014 
All transformations in Spark are lazy and are only 
computed when an actions requires it.
2 Spark: Persistence 
24.11.2014 
One of the most important capabilities in Spark is 
caching a dataset in-memory across operations 
• cache() MEMORY_ONLY 
• persist() MEMORY_ONLY
2 Spark: Storage Levels 
24.11.2014 
• persist(Storage Level) 
Storage Level Meaning 
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does 
not fit in memory, some partitions will not be cached and will be 
recomputed on the fly each time they're needed. This is the default 
level. 
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does 
not fit in memory, store the partitions that don't fit on disk, and 
read them from there when they're needed. 
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). 
This is generally more space-efficient than deserialized objects, 
especially when using a fast serializer, but more CPU-intensive to 
read. 
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in 
memory to disk instead of recomputing them on the fly each time 
they're needed. 
DISK_ONLY Store the RDD partitions only on disk. 
MEMORY_ONLY_2, 
MEMORY_AND_DISK_2, 
… … … 
Same as the levels above, but replicate each partition on two cluster 
nodes.
2 Spark: Parallelism 
24.11.2014 
Can be specified in a number of different ways 
• RDD partition number 
• sc.textFile(input, minSplits = 10) 
• sc.parallelize(1 to 10000, numSlices = 10) 
• Mapper side parallelism 
• Usually inherited from parent RDD(s) 
• Reducer side parallelism 
• rdd.reduceByKey(_ + _, numPartitions = 10) 
• rdd.reduceByKey(partitioner = p, _ + _) 
• “Zoom in/out” 
• rdd.repartition(numPartitions: Int) 
• rdd.coalesce(numPartitions: Int, shuffle: Boolean)
2 Spark: Example 
24.11.2014 
Text Processing Example 
Top words by frequency
2 Spark: Frequency Example 
24.11.2014 
Create RDD from external data 
Data Sources supported by 
Hadoop 
Cassandra ElasticSearch 
HDFS S3 HBase 
Mongo 
DB 
… 
I/O via Hadoop optional 
// Step 1. Create RDD from Hadoop text files 
val docs = spark.textFile(“hdfs://docs/“)
2 Spark: Frequency Example 
24.11.2014 
Function map 
Hello World 
This is 
Spark 
Spark 
The end 
hello world 
this is 
spark 
spark 
the end 
RDD[String] 
.map(line = line.ToLowerCase) 
RDD[String]
2 Spark: Frequency Example 
24.11.2014 
Function map 
Hello World 
This is 
Spark 
Spark 
The end 
hello world 
this is 
spark 
spark 
the end 
RDD[String] 
.map(line = line.ToLowerCase) 
RDD[String] 
= 
.map(_.ToLowerCase)
2 Spark: Frequency Example 
24.11.2014 
Function map 
Hello World 
This is 
Spark 
Spark 
The end 
= 
// Step 2. Convert lines to lower case 
val lower = docs.map(line = line.ToLowerCase) 
hello world 
this is 
spark 
spark 
the end 
RDD[String] 
.map(line = line.ToLowerCase) 
RDD[String] 
.map(_.ToLowerCase)
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[Array[String]] 
hello 
spark 
_.split(s+) 
world 
this is spark 
the end
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[String] 
RDD[Array[String]] 
hello 
spark 
.flatten* 
_.split(s+) 
world 
this is spark 
hello 
world 
this 
the end 
end
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[String] 
RDD[Array[String]] 
hello 
world 
this is spark 
spark 
.flatten* 
_.split(s+) 
the end 
.flatMap(line = line.split(“s+“)) 
hello 
world 
this 
end
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[String] 
RDD[Array[String]] 
hello 
world 
this is spark 
spark 
.flatten* 
_.split(s+) 
hello 
world 
this 
the end 
end 
.flatMap(line = line.split(“s+“)) 
// Step 3. Split lines into words 
val words = lower.flatMap(line = line.split(“s+“))
2 Spark: Frequency Example 
24.11.2014 
Key-Value Pairs 
RDD[String] 
hello 
world 
spark 
end 
.map(word = Tuple2(word,1)) 
RDD[(String, Int)] 
hello 
world 
spark 
end 
spark 
1 
1 
spark 
1 
1 
1
2 Spark: Frequency Example 
24.11.2014 
Key-Value Pairs 
RDD[String] 
hello 
world 
spark 
end 
.map(word = Tuple2(word,1)) 
= 
.map(word = (word,1)) 
RDD[(String, Int)] 
hello 
world 
spark 
end 
spark 
1 
1 
spark 
1 
1 
1
2 Spark: Frequency Example 
24.11.2014 
Key-Value Pairs 
RDD[String] 
hello 
world 
spark 
end 
.map(word = Tuple2(word,1)) 
= 
.map(word = (word,1)) 
// Step 4. Convert into tuples 
val counts = words.map(word = (word,1)) 
RDD[(String, Int)] 
hello 
world 
spark 
end 
spark 
1 
1 
spark 
1 
1 
1
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
end 
1 
1 
spark 
1 
1 
1 
RDD[(String, Iterator(Int))] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
end 
1 
1 
spark 
1 
1 
1 
RDD[(String, Iterator(Int))] RDD[(String, Int)] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1 
end 1 
hello 1 
spark 2 
world 1 
.mapValues 
_.reduce… 
(a,b) = a+b
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
end 
1 
1 
spark 
1 
1 
1 
RDD[(String, Iterator(Int))] RDD[(String, Int)] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1 
end 1 
hello 1 
spark 2 
world 1 
.mapValues 
_.reduce… 
(a,b) = a+b 
.reduceByKey((a,b) = a+b)
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
spark 
end 
1 
1 
1 
1 
1 
RDD[(String, Iterator(Int))] RDD[(String, Int)] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1 
// Step 5. Count all words 
val freq = counts.reduceByKey(_ + _) 
end 1 
hello 1 
spark 2 
world 1 
.mapValues 
_.reduce… 
(a,b) = a+b
2 Spark: Frequency Example 
24.11.2014 
Top N (Prepare data) 
RDD[(String, Int)] 
end 1 
hello 1 
spark 2 
world 1 
// Step 6. Swap tupels (Partial code) 
freq.map(_.swap) 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
.map(_.swap)
2 Spark: Frequency Example 
24.11.2014 
Top N (First Attempt) 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
RDD[(Int, String)] 
2 spark 
1 end 
1 hello 
1 world 
.sortByKey
2 Spark: Frequency Example 
24.11.2014 
Top N 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
RDD[(Int, String)] 
2 spark 
1 end 
1 hello 
1 world 
local top N 
.top(N) 
local top N
2 Spark: Frequency Example 
24.11.2014 
Top N 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
RDD[(Int, String)] 
2 spark 
1 end 
1 hello 
1 world 
.top(N) 
Array[(Int, String)] 
2 spark 
1 end 
local top N 
local top N 
reduction
2 Spark: Frequency Example 
24.11.2014 
Top N 
RDD[(Int, String)] 
1 end 
1 hello 
2 
spark 
1 world 
RDD[(Int, String)] 
spark 
2 
1 end 
1 hello 
1 world 
.top(N) 
Array[(Int, String)] 
2 spark 
1 end 
local top N 
local top N 
reduction 
// Step 6. Swap tupels (Complete code) 
val top = freq.map(_.swap).top(N)
2 Spark: Frequency Example 
24.11.2014 
val spark = new SparkContext() 
// Create RDD from Hadoop text file 
val docs = spark.textFile(“hdfs://docs/“) 
// Split lines into words and process 
val lower = docs.map(line = line.ToLowerCase) 
val words = lower.flatMap(line = line.split(“s+“)) 
val counts = words.map(word = (word,1)) 
// Count all words 
val freq = counts.reduceByKey(_ + _) 
// Swap tupels and get top results 
val top = freq.map(_.swap).top(N) 
top.foreach(println)
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark: Streaming 
24.11.2014 
• Real-time computation 
• Similar to Apache Storm… 
• Streaming input split into sliding windows of 
RDD‘s 
• Input distributed to memory for fault 
tolerance 
• Supports input from Kafka, Flume, ZeroMQ, 
HDFS, S3, Kinesis, Twitter, …
2 Spark: Streaming 
24.11.2014 
Discretized Stream 
Windowed Computations
2 Spark: Streaming 
24.11.2014 
TwitterUtils.createStream() 
.filter(_.getText.contains(Spark)) 
.countByWindow(Seconds(5))
2 Spark: SQL 
24.11.2014 
• Spark SQL allows relational queries 
expressed in SQL, HiveQL or Scala 
• Uses SchemaRDD’s composed of Row objects 
(= table in a traditional RDBMS) 
• SchemaRDD can be created from an 
• Existing RDD 
• Parquet File 
• JSON dataset 
• By running HiveQL against data stored in Apache Hive 
• Supports a domain specific language for 
writing queries
2 Spark: SQL 
24.11.2014 
registerFunction(LEN, (_: String).length) 
val queryRdd = sql( 
SELECT * FROM counts 
WHERE LEN(word) = 10 
ORDER BY total DESC 
LIMIT 10 
) 
queryRdd 
.map( c = sword: ${c(0)} t| total: ${c(1)}) 
.collect() 
.foreach(println)
2 Spark: GraphX 
24.11.2014 
• GraphX is the Spark API for graphs 
and graph-parallel computation 
• API’s to join and traverse graphs 
• Optimally partitions and indexes 
vertices  edges (represented as RDD’s) 
• Supports PageRank, connected 
components, triangle counting, …
2 Spark: GraphX 
24.11.2014 
val graph = Graph(userIdRDD, assocRDD) 
val ranks = graph.pageRank(0.0001).vertices 
val userRDD = sc.textFile(graphx/data/users.txt) 
val users = userRdd. map {line = 
val fields = line.split(,) 
(fields(0).toLong, fields( 1)) 
} 
val ranksByUsername = users.join(ranks).map { 
case (id, (username, rank)) = (username, rank) 
}
2 Spark: MLlib 
24.11.2014 
• Machine learning library similar to 
Apache Mahout 
• Supports statistics, regression, decision 
trees, clustering, PCA, gradient 
descent, … 
• Iterative algorithms much faster due to 
in-memory processing
2 Spark: MLlib 
24.11.2014 
val data = sc.textFile(data.txt) 
val parsedData = data.map {line = 
val parts = line.split(',') 
LabeledPoint( 
parts( 0). toDouble, 
Vectors.dense(parts(1).split(' ').map(_.toDouble)) ) 
} 
val model = LinearRegressionWithSGD.train( 
parsedData, 100 
) 
val valuesAndPreds = parsedData.map {point = 
val prediction = model.predict(point.features) 
(point.label, prediction) 
} 
val MSE = valuesAndPreds 
.map{case(v, p) = math.pow((v - p), 2)}.mean()
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Use Case: Yahoo Native Ads 
24.11.2014 
Logistic regression 
algorithm 
• 120 LOC in Spark/Scala 
• 30 min. on model creation for 
100M samples and 13K 
features 
Initial version launched 
within 2 hours after Spark-on- 
YARN announcement 
• Compared: Several days on 
hardware acquisition, system 
setup and data movement 
http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
2 Use Case: Yahoo Mobile Ads 
24.11.2014 
Learn from mobile search 
ads clicks data 
• 600M labeled examples on 
HDFS 
• 100M sparse features 
Spark programs for 
Gradient Boosting Decision 
Trees 
• 6 hours for model training 
with 100 workers 
• Model with accuracy very 
close to heavily-manually-tuned 
Logistic Regression 
models 
http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark-on-YARN (Current) 
24.11.2014 
Hadoop 2 Spark as YARN App 
Pig … In- 
Hive Stream 
Tez 
Spark MapReduce 
Execution Engine 
Execution Engine 
YARN 
Memory 
Cluster resource management 
HDFS 
Redundant, reliable storage 
ing 
Storm 
…
2 Spark-on-YARN (Future) 
24.11.2014 
Hadoop 2 Spark as Execution Engine 
Hive … Mahout 
YARN 
HDFS 
Pig 
MapReduce 
Execution Engine 
Stream 
ing 
Storm 
… 
Tez 
Execution Engine 
Spark 
Execution Engine 
Slider
2 Spark: Future work 
24.11.2014 
• Spark Core 
• Focus on maturity, optimization  
pluggability 
• Enable long-running services (Slider) 
• Give resources back to cluster when idle 
• Integrate with Hadoop enhancements 
• Timeline server 
• ORC File Format 
• Spark Eco System 
• Focus on adding capabilities
2 One more thing… 
24.11.2014 
Let’s get started with 
Spark!
2 Hortonworks Sandbox 2.2 
24.11.2014 
http://hortonworks.com/hdp/downloads/
2 Hortonworks Sandbox 2.2 
24.11.2014 
// 1. Download 
wget http://public-repo-1.hortonworks.com/HDP-LABS/ 
Projects/spark/1.1.1/spark-1.1.0.2.1.5.0-701-bin- 
2.4.0.tgz 
// 2. Untar 
tar xvfz spark-1.1.0.2.1.5.0-701-bin-2.4.0.tgz 
// 3. Start Spark Shell 
./bin/spark-shell
2 Thanks for listening 
24.11.2014 
Twitter: 
@uweseiler 
Mail: 
uwe.seiler@codecentric.de 
XING: 
https://www.xing.com/profile 
/Uwe_Seiler

Apache Spark

  • 1.
  • 2.
    2 About me 24.11.2014 Big Data Nerd Hadoop Trainer NoSQL Fan Boy Photography Enthusiast Travelpirate
  • 3.
    2 About us 24.11.2014 specializes on... Big Data Nerds Agile Ninjas Continuous Delivery Gurus Enterprise Java Specialists Performance Geeks Join us!
  • 4.
    2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 5.
    2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 6.
    2 Spark: Ina tweet 24.11.2014 “Spark … is what you might call a Swiss Army knife of Big Data analytics tools” – Reynold Xin (@rxin), Berkeley AmpLab Shark Development Lead
  • 7.
    2 Spark: Ina nutshell 24.11.2014 • Fast and general engine for large scale data processing • Advanced DAG execution engine with support for in-memory storage data locality (micro) batch streaming support • Improves usability via Rich APIs in Scala, Java, Python Interactive shell • Runs Standalone, on YARN, on Mesos, and on Amazon EC2
  • 8.
    2 Spark isalso… 24.11.2014 • Came out of AMPLab at UCB in 2009 • A top-level Apache project as of 2014 – http://spark.apache.org • Backed by a commercial entity: Databricks • A toolset for Data Scientist / Analysts • Implementation of Resilient Distributed Dataset (RDD) in Scala • Hadoop Compatible
  • 9.
    2 Spark: Trends 24.11.2014 Apache Drill Apache Storm Apache Spark Apache YARN Apache Tez Generated using http://www.google.com/trends/
  • 10.
    2 Spark: Community 24.11.2014 https://github.com/apache/spark/pulse
  • 11.
    2 Spark: Performance 24.11.2014 3X faster using 10X fewer machines http://finance.yahoo.com/news/apache-spark-beats-world-record-130000796.html http://www.wired.com/2014/10/startup-crunches-100-terabytes-data-record-23-minutes/
  • 12.
    2 24.11.2014 BlinkDB MapReduce Cluster resource mgmt. + data processing HDFS Spark: Ecosystem Redundant, reliable storage Spark Core Spark SQL SQL Spark Streaming Streaming MLlib Machine Learning SparkR R on Spark GraphX Graph Computation
  • 13.
    2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 14.
    2 Spark: CoreConcept 24.11.2014 • Resilient Distributed Dataset (RDD) Conceptually, RDDs can be roughly viewed as partitioned, locality aware distributed vectors RDD A11 A12 A13 • Read-only collection of objects spread across a cluster • Built through parallel transformations actions • Computation can be represented by lazy evaluated lineage DAGs composed by connected RDDs • Automatically rebuilt on failure • Controllable persistence
  • 15.
    2 Spark: RDDExample 24.11.2014 Base RDD from HDFS lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(Error)) messages = errors.map(_.split('t')(2)) messages.cache() RDD in memory Iterative Processing for (str - Array(“foo”, “bar”)) messages.filter(_.contains(str)).count()
  • 16.
    2 Spark: Transformations 24.11.2014 Transformations - Create new datasets from existing ones map
  • 17.
    2 Spark: Transformations 24.11.2014 Transformations - Create new datasets from existing ones map(func) filter(func) flatMap(func) mapPartitions(func) mapPartitionsWithIndex(func) union(otherDataset) intersection(otherDataset) distinct([numTasks])) groupByKey([numTasks]) sortByKey([ascending], [numTasks]) reduceByKey(func, [numTasks]) aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) join(otherDataset, [numTasks]) cogroup(otherDataset, [numTasks]) cartesian(otherDataset) pipe(command, [envVars]) coalesce(numPartitions) sample(withReplacement,fraction, seed) repartition(numPartitions)
  • 18.
    2 Spark: Actions 24.11.2014 Actions - Return a value to the client after running a computation on the dataset reduce
  • 19.
    2 Spark: Actions 24.11.2014 Actions - Return a value to the client after running a computation on the dataset reduce(func) collect() count() first() countByKey() foreach(func) take(n) takeSample(withReplacement,num, [seed]) takeOrdered(n, [ordering]) saveAsTextFile(path) saveAsSequenceFile(path) (Only Java and Scala) saveAsObjectFile(path) (Only Java and Scala)
  • 20.
    2 Spark: Dataflow 24.11.2014 All transformations in Spark are lazy and are only computed when an actions requires it.
  • 21.
    2 Spark: Persistence 24.11.2014 One of the most important capabilities in Spark is caching a dataset in-memory across operations • cache() MEMORY_ONLY • persist() MEMORY_ONLY
  • 22.
    2 Spark: StorageLevels 24.11.2014 • persist(Storage Level) Storage Level Meaning MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISK_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2, … … … Same as the levels above, but replicate each partition on two cluster nodes.
  • 23.
    2 Spark: Parallelism 24.11.2014 Can be specified in a number of different ways • RDD partition number • sc.textFile(input, minSplits = 10) • sc.parallelize(1 to 10000, numSlices = 10) • Mapper side parallelism • Usually inherited from parent RDD(s) • Reducer side parallelism • rdd.reduceByKey(_ + _, numPartitions = 10) • rdd.reduceByKey(partitioner = p, _ + _) • “Zoom in/out” • rdd.repartition(numPartitions: Int) • rdd.coalesce(numPartitions: Int, shuffle: Boolean)
  • 24.
    2 Spark: Example 24.11.2014 Text Processing Example Top words by frequency
  • 25.
    2 Spark: FrequencyExample 24.11.2014 Create RDD from external data Data Sources supported by Hadoop Cassandra ElasticSearch HDFS S3 HBase Mongo DB … I/O via Hadoop optional // Step 1. Create RDD from Hadoop text files val docs = spark.textFile(“hdfs://docs/“)
  • 26.
    2 Spark: FrequencyExample 24.11.2014 Function map Hello World This is Spark Spark The end hello world this is spark spark the end RDD[String] .map(line = line.ToLowerCase) RDD[String]
  • 27.
    2 Spark: FrequencyExample 24.11.2014 Function map Hello World This is Spark Spark The end hello world this is spark spark the end RDD[String] .map(line = line.ToLowerCase) RDD[String] = .map(_.ToLowerCase)
  • 28.
    2 Spark: FrequencyExample 24.11.2014 Function map Hello World This is Spark Spark The end = // Step 2. Convert lines to lower case val lower = docs.map(line = line.ToLowerCase) hello world this is spark spark the end RDD[String] .map(line = line.ToLowerCase) RDD[String] .map(_.ToLowerCase)
  • 29.
    2 Spark: FrequencyExample 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[Array[String]] hello spark _.split(s+) world this is spark the end
  • 30.
    2 Spark: FrequencyExample 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[String] RDD[Array[String]] hello spark .flatten* _.split(s+) world this is spark hello world this the end end
  • 31.
    2 Spark: FrequencyExample 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[String] RDD[Array[String]] hello world this is spark spark .flatten* _.split(s+) the end .flatMap(line = line.split(“s+“)) hello world this end
  • 32.
    2 Spark: FrequencyExample 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[String] RDD[Array[String]] hello world this is spark spark .flatten* _.split(s+) hello world this the end end .flatMap(line = line.split(“s+“)) // Step 3. Split lines into words val words = lower.flatMap(line = line.split(“s+“))
  • 33.
    2 Spark: FrequencyExample 24.11.2014 Key-Value Pairs RDD[String] hello world spark end .map(word = Tuple2(word,1)) RDD[(String, Int)] hello world spark end spark 1 1 spark 1 1 1
  • 34.
    2 Spark: FrequencyExample 24.11.2014 Key-Value Pairs RDD[String] hello world spark end .map(word = Tuple2(word,1)) = .map(word = (word,1)) RDD[(String, Int)] hello world spark end spark 1 1 spark 1 1 1
  • 35.
    2 Spark: FrequencyExample 24.11.2014 Key-Value Pairs RDD[String] hello world spark end .map(word = Tuple2(word,1)) = .map(word = (word,1)) // Step 4. Convert into tuples val counts = words.map(word = (word,1)) RDD[(String, Int)] hello world spark end spark 1 1 spark 1 1 1
  • 36.
    2 Spark: FrequencyExample 24.11.2014 Shuffling RDD[(String, Int)] hello world spark end 1 1 spark 1 1 1 RDD[(String, Iterator(Int))] .groupByKey end 1 hello 1 spark 1 1 world 1
  • 37.
    2 Spark: FrequencyExample 24.11.2014 Shuffling RDD[(String, Int)] hello world spark end 1 1 spark 1 1 1 RDD[(String, Iterator(Int))] RDD[(String, Int)] .groupByKey end 1 hello 1 spark 1 1 world 1 end 1 hello 1 spark 2 world 1 .mapValues _.reduce… (a,b) = a+b
  • 38.
    2 Spark: FrequencyExample 24.11.2014 Shuffling RDD[(String, Int)] hello world spark end 1 1 spark 1 1 1 RDD[(String, Iterator(Int))] RDD[(String, Int)] .groupByKey end 1 hello 1 spark 1 1 world 1 end 1 hello 1 spark 2 world 1 .mapValues _.reduce… (a,b) = a+b .reduceByKey((a,b) = a+b)
  • 39.
    2 Spark: FrequencyExample 24.11.2014 Shuffling RDD[(String, Int)] hello world spark spark end 1 1 1 1 1 RDD[(String, Iterator(Int))] RDD[(String, Int)] .groupByKey end 1 hello 1 spark 1 1 world 1 // Step 5. Count all words val freq = counts.reduceByKey(_ + _) end 1 hello 1 spark 2 world 1 .mapValues _.reduce… (a,b) = a+b
  • 40.
    2 Spark: FrequencyExample 24.11.2014 Top N (Prepare data) RDD[(String, Int)] end 1 hello 1 spark 2 world 1 // Step 6. Swap tupels (Partial code) freq.map(_.swap) RDD[(Int, String)] 1 end 1 hello 2 spark 1 world .map(_.swap)
  • 41.
    2 Spark: FrequencyExample 24.11.2014 Top N (First Attempt) RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] 2 spark 1 end 1 hello 1 world .sortByKey
  • 42.
    2 Spark: FrequencyExample 24.11.2014 Top N RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] 2 spark 1 end 1 hello 1 world local top N .top(N) local top N
  • 43.
    2 Spark: FrequencyExample 24.11.2014 Top N RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] 2 spark 1 end 1 hello 1 world .top(N) Array[(Int, String)] 2 spark 1 end local top N local top N reduction
  • 44.
    2 Spark: FrequencyExample 24.11.2014 Top N RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] spark 2 1 end 1 hello 1 world .top(N) Array[(Int, String)] 2 spark 1 end local top N local top N reduction // Step 6. Swap tupels (Complete code) val top = freq.map(_.swap).top(N)
  • 45.
    2 Spark: FrequencyExample 24.11.2014 val spark = new SparkContext() // Create RDD from Hadoop text file val docs = spark.textFile(“hdfs://docs/“) // Split lines into words and process val lower = docs.map(line = line.ToLowerCase) val words = lower.flatMap(line = line.split(“s+“)) val counts = words.map(word = (word,1)) // Count all words val freq = counts.reduceByKey(_ + _) // Swap tupels and get top results val top = freq.map(_.swap).top(N) top.foreach(println)
  • 46.
    2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 47.
    2 Spark: Streaming 24.11.2014 • Real-time computation • Similar to Apache Storm… • Streaming input split into sliding windows of RDD‘s • Input distributed to memory for fault tolerance • Supports input from Kafka, Flume, ZeroMQ, HDFS, S3, Kinesis, Twitter, …
  • 48.
    2 Spark: Streaming 24.11.2014 Discretized Stream Windowed Computations
  • 49.
    2 Spark: Streaming 24.11.2014 TwitterUtils.createStream() .filter(_.getText.contains(Spark)) .countByWindow(Seconds(5))
  • 50.
    2 Spark: SQL 24.11.2014 • Spark SQL allows relational queries expressed in SQL, HiveQL or Scala • Uses SchemaRDD’s composed of Row objects (= table in a traditional RDBMS) • SchemaRDD can be created from an • Existing RDD • Parquet File • JSON dataset • By running HiveQL against data stored in Apache Hive • Supports a domain specific language for writing queries
  • 51.
    2 Spark: SQL 24.11.2014 registerFunction(LEN, (_: String).length) val queryRdd = sql( SELECT * FROM counts WHERE LEN(word) = 10 ORDER BY total DESC LIMIT 10 ) queryRdd .map( c = sword: ${c(0)} t| total: ${c(1)}) .collect() .foreach(println)
  • 52.
    2 Spark: GraphX 24.11.2014 • GraphX is the Spark API for graphs and graph-parallel computation • API’s to join and traverse graphs • Optimally partitions and indexes vertices edges (represented as RDD’s) • Supports PageRank, connected components, triangle counting, …
  • 53.
    2 Spark: GraphX 24.11.2014 val graph = Graph(userIdRDD, assocRDD) val ranks = graph.pageRank(0.0001).vertices val userRDD = sc.textFile(graphx/data/users.txt) val users = userRdd. map {line = val fields = line.split(,) (fields(0).toLong, fields( 1)) } val ranksByUsername = users.join(ranks).map { case (id, (username, rank)) = (username, rank) }
  • 54.
    2 Spark: MLlib 24.11.2014 • Machine learning library similar to Apache Mahout • Supports statistics, regression, decision trees, clustering, PCA, gradient descent, … • Iterative algorithms much faster due to in-memory processing
  • 55.
    2 Spark: MLlib 24.11.2014 val data = sc.textFile(data.txt) val parsedData = data.map {line = val parts = line.split(',') LabeledPoint( parts( 0). toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)) ) } val model = LinearRegressionWithSGD.train( parsedData, 100 ) val valuesAndPreds = parsedData.map {point = val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds .map{case(v, p) = math.pow((v - p), 2)}.mean()
  • 56.
    2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 57.
    2 Use Case:Yahoo Native Ads 24.11.2014 Logistic regression algorithm • 120 LOC in Spark/Scala • 30 min. on model creation for 100M samples and 13K features Initial version launched within 2 hours after Spark-on- YARN announcement • Compared: Several days on hardware acquisition, system setup and data movement http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
  • 58.
    2 Use Case:Yahoo Mobile Ads 24.11.2014 Learn from mobile search ads clicks data • 600M labeled examples on HDFS • 100M sparse features Spark programs for Gradient Boosting Decision Trees • 6 hours for model training with 100 workers • Model with accuracy very close to heavily-manually-tuned Logistic Regression models http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
  • 59.
    2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 60.
    2 Spark-on-YARN (Current) 24.11.2014 Hadoop 2 Spark as YARN App Pig … In- Hive Stream Tez Spark MapReduce Execution Engine Execution Engine YARN Memory Cluster resource management HDFS Redundant, reliable storage ing Storm …
  • 61.
    2 Spark-on-YARN (Future) 24.11.2014 Hadoop 2 Spark as Execution Engine Hive … Mahout YARN HDFS Pig MapReduce Execution Engine Stream ing Storm … Tez Execution Engine Spark Execution Engine Slider
  • 62.
    2 Spark: Futurework 24.11.2014 • Spark Core • Focus on maturity, optimization pluggability • Enable long-running services (Slider) • Give resources back to cluster when idle • Integrate with Hadoop enhancements • Timeline server • ORC File Format • Spark Eco System • Focus on adding capabilities
  • 63.
    2 One morething… 24.11.2014 Let’s get started with Spark!
  • 64.
    2 Hortonworks Sandbox2.2 24.11.2014 http://hortonworks.com/hdp/downloads/
  • 65.
    2 Hortonworks Sandbox2.2 24.11.2014 // 1. Download wget http://public-repo-1.hortonworks.com/HDP-LABS/ Projects/spark/1.1.1/spark-1.1.0.2.1.5.0-701-bin- 2.4.0.tgz // 2. Untar tar xvfz spark-1.1.0.2.1.5.0-701-bin-2.4.0.tgz // 3. Start Spark Shell ./bin/spark-shell
  • 66.
    2 Thanks forlistening 24.11.2014 Twitter: @uweseiler Mail: uwe.seiler@codecentric.de XING: https://www.xing.com/profile /Uwe_Seiler