SlideShare a Scribd company logo
1 of 74
Download to read offline
Why Spark Is 
the Next Top 
(Compute) 
Model 
Big 
Data 
Everywhere 
Chicago, 
Oct. 
1, 
2014 
@deanwampler 
polyglotprogramming.com/talks 
Tuesday, September 30, 14 
Copyright (c) 2014, Dean Wampler, All Rights Reserved, unless otherwise noted. 
Image: Detail of the London Eye
Dean Wampler 
dean.wampler@typesafe.com 
polyglotprogramming.com/talks 
@deanwampler 
Tuesday, September 30, 14 
About me. You can find this presentation and others on Big Data and Scala at polyglotprogramming.com. 
Programming Scala, 2nd Edition is forthcoming. 
photo: Dusk at 30,000 ft above the Central Plains of the U.S. on a Winter’s Day.
Or 
this? 
Tuesday, September 30, 14 
photo: https://twitter.com/john_overholt/status/447431985750106112/photo/1
Hadoop 
circa 2013 
Tuesday, September 30, 14 
The state of Hadoop as of last year. 
Image: Detail of the London Eye
Hadoop v2.X Cluster 
node 
Node Mgr 
Data Node 
DiskDiskDiskDiskDisk 
node 
Node Mgr 
Data Node 
DiskDiskDiskDiskDisk 
node 
Node Mgr 
Data Node 
DiskDiskDiskDiskDisk 
master 
Resource Mgr 
Name Node 
master 
Name Node 
Tuesday, September 30, 14 
Schematic view of a Hadoop 2 cluster. For a more precise definition of the services and what they do, see e.g., http://hadoop.apache.org/ 
docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html We aren’t interested in great details at this point, but we’ll call out a few useful things to 
know.
Resource and Node Managers 
Hadoop v2.X Cluster 
node 
Node Mgr 
Data Node 
DiskDiskDiskDiskDisk 
node 
Node Mgr 
Data Node 
DiskDiskDiskDiskDisk 
node 
Node Mgr 
Data Node 
DiskDiskDiskDiskDisk 
master 
Resource Mgr 
Name Node 
master 
Name Node 
Tuesday, September 30, 14 
Hadoop 2 uses YARN to manage resources via the master Resource Manager, which includes a pluggable job scheduler and an Applications 
Manager. It coordinates with the Node Manager on each node to schedule jobs and provide resources. Other services involved, including 
application-specific Containers and Application Masters are not shown.
Name Node and Data Nodes 
Hadoop v2.X Cluster 
node 
Node Mgr 
Data Node 
DiskDiskDiskDiskDisk 
node 
Node Mgr 
Data Node 
DiskDiskDiskDiskDisk 
node 
Node Mgr 
Data Node 
DiskDiskDiskDiskDisk 
master 
Resource Mgr 
Name Node 
master 
Name Node 
Tuesday, September 30, 14 
Hadoop 2 clusters federate the Name node services that manage the file system, HDFS. They provide horizontal scalability of file-system 
operations and resiliency when service instances fail. The data node services manage individual blocks for files.
MapReduce 
The classic compute model 
for Hadoop 
Tuesday, September 30, 14 
Hadoop 2 clusters federate the Name node services that manage the file system, HDFS. They provide horizontal scalability of file-system 
operations and resiliency when service instances fail. The data node services manage individual blocks for files.
MapReduce 
1 map step + 1 reduce step 
(wash, rinse, repeat) 
Tuesday, September 30, 14 
You get 1 map step (although there is limited support for chaining mappers) and 1 reduce step. If you can’t implement an algorithm in these 
two steps, you can chain jobs together, but you’ll pay a tax of flushing the entire data set to disk between these jobs.
MapReduce 
Example: 
Inverted Index 
Tuesday, September 30, 14 
The inverted index is a classic algorithm needed for building search engines.
Tuesday, September 30, 14 
Before running MapReduce, crawl teh interwebs, find all the pages, and build a data set of URLs -> doc contents, written to flat files in HDFS 
or one of the more “sophisticated” formats.
Tuesday, September 30, 14 
Now we’re running MapReduce. In the map step, a task (JVM process) per file *block* (64MB or larger) reads the rows, tokenizes the text and 
outputs key-value pairs (“tuples”)...
Map Task 
(hadoop,(wikipedia.org/hadoop,1)) 
(provides,(wikipedia.org/hadoop,1)) 
(mapreduce,(wikipedia.org/hadoop, 1)) 
(and,(wikipedia.org/hadoop,1)) 
(hdfs,(wikipedia.org/hadoop, 1)) 
Tuesday, September 30, 14 
... the keys are each word found and the values are themselves tuples, each URL and the count of the word. In our simplified example, there 
are typically only single occurrences of each work in each document. The real occurrences are interesting because if a word is mentioned a 
lot in a document, the chances are higher that you would want to find that document in a search for that word.
Tuesday, September 30, 14
Tuesday, September 30, 14 
The output tuples are sorted by key locally in each map task, then “shuffled” over the cluster network to reduce tasks (each a JVM process, 
too), where we want all occurrences of a given key to land on the same reduce task.
Tuesday, September 30, 14 
Finally, each reducer just aggregates all the values it receives for each key, then writes out new files to HDFS with the words and a list of 
(URL-count) tuples (pairs).
Altogether... 
Tuesday, September 30, 14 
Finally, each reducer just aggregates all the values it receives for each key, then writes out new files to HDFS with the words and a list of 
(URL-count) tuples (pairs).
What’s 
not to like? Tuesday, September 30, 14 
This seems okay, right? What’s wrong with it?
Awkward 
Most algorithms are 
much harder to implement 
in this restrictive 
map-then-reduce model. 
Tuesday, September 30, 14 
Writing MapReduce jobs requires arcane, specialized skills that few master. For a good overview, see http://lintool.github.io/ 
MapReduceAlgorithms/.
Awkward 
Lack of flexibility inhibits 
optimizations, too. 
Tuesday, September 30, 14 
The inflexible compute model leads to complex code to improve performance where hacks are used to work around the limitations. Hence, 
optimizations are hard to implement. The Spark team has commented on this, see http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating- 
structured-data-using-Spark.html
Performance 
Full dump to disk 
between jobs. 
Tuesday, September 30, 14 
Sequencing jobs wouldn’t be so bad if the “system” was smart enough to cache data in memory. Instead, each job dumps everything to disk, 
then the next job reads it back in again. This makes iterative algorithms particularly painful.
Enter 
Spark 
spark.apache.org 
Tuesday, September 30, 14
Cluster Computing 
Can be run in: 
•YARN (Hadoop 2) 
•Mesos (Cluster management) 
•EC2 
•Standalone mode 
Tuesday, September 30, 14 
If you have a Hadoop cluster, you can run Spark as a seamless compute engine on YARN. (You can also use pre-YARN Hadoop v1 clusters, 
but there you have manually allocate resources to the embedded Spark cluster vs Hadoop.) Mesos is a general-purpose cluster resource 
manager that can also be used to manage Hadoop resources. Scripts for running a Spark cluster in EC2 are available. Standalone just means 
you run Spark’s built-in support for clustering (or run locally on a single box - e.g., for development). EC2 deployments are usually 
standalone...
Compute Model 
Fine-grained “combinators” 
for composing algorithms. 
Tuesday, September 30, 14 
Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little 
code.
Compute Model 
RDDs: 
Resilient, 
Distributed 
Datasets 
Tuesday, September 30, 14 
RDDs shard the data over a cluster, like a virtualized, distributed collection (analogous to HDFS). They support intelligent caching, which 
means no naive flushes of massive datasets to disk. This feature alone allows Spark jobs to run 10-100x faster than comparable MapReduce 
jobs! The “resilient” part means they will reconstitute shards lost due to process/server crashes.
Compute Model 
Tuesday, September 30, 14 
RDDs shard the data over a cluster, like a virtualized, distributed collection (analogous to HDFS). They support intelligent caching, which 
means no naive flushes of massive datasets to disk. This feature alone allows Spark jobs to run 10-100x faster than comparable MapReduce 
jobs! The “resilient” part means they will reconstitute shards lost due to process/server crashes.
Compute Model 
Written in Scala, 
with Java and Python APIs. 
Tuesday, September 30, 14 
Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little 
code.
Inverted Index 
in MapReduce 
(Java). 
Tuesday, September 30, 14 
Let’s 
see 
an 
an 
actual 
implementaEon 
of 
the 
inverted 
index. 
First, 
a 
Hadoop 
MapReduce 
(Java) 
version, 
adapted 
from 
hOps:// 
developer.yahoo.com/hadoop/tutorial/module4.html#soluEon 
It’s 
about 
90 
lines 
of 
code, 
but 
I 
reformaOed 
to 
fit 
beOer. 
This 
is 
also 
a 
slightly 
simpler 
version 
that 
the 
one 
I 
diagrammed. 
It 
doesn’t 
record 
a 
count 
of 
each 
word 
in 
a 
document; 
it 
just 
writes 
(word,doc-­‐Etle) 
pairs 
out 
of 
the 
mappers 
and 
the 
final 
(word,list) 
output 
by 
the 
reducers 
just 
has 
a 
list 
of 
documentaEons, 
hence 
repeats. 
A 
second 
job 
would 
be 
necessary 
to 
count 
the 
repeats.
import java.io.IOException; 
import java.util.*; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.*; 
import org.apache.hadoop.mapred.*; 
public class LineIndexer { 
public static void main(String[] args) { 
JobClient client = new JobClient(); 
JobConf conf = 
new JobConf(LineIndexer.class); 
conf.setJobName("LineIndexer"); 
conf.setOutputKeyClass(Text.class); 
conf.setOutputValueClass(Text.class); 
FileInputFormat.addInputPath(conf, 
new Path("input")); 
FileOutputFormat.setOutputPath(conf, 
Tuesday, September 30, 14 
I’ve 
shortened 
the 
original 
code 
a 
bit, 
e.g., 
using 
* 
import 
statements 
instead 
of 
separate 
imports 
for 
each 
class. 
I’m 
not 
going 
to 
explain 
every 
line 
... 
nor 
most 
lines. 
Everything 
is 
in 
one 
outer 
class. 
We 
start 
with 
a 
main 
rouEne 
that 
sets 
up 
the 
job. 
LoOa 
boilerplate... 
I 
used 
yellow 
for 
method 
calls, 
because 
methods 
do 
the 
real 
work!! 
But 
noEce 
that 
the 
funcEons 
in 
this 
code 
don’t 
really 
do 
a 
whole 
lot...
JobClient client = new JobClient(); 
JobConf conf = 
new JobConf(LineIndexer.class); 
conf.setJobName("LineIndexer"); 
conf.setOutputKeyClass(Text.class); 
conf.setOutputValueClass(Text.class); 
FileInputFormat.addInputPath(conf, 
new Path("input")); 
FileOutputFormat.setOutputPath(conf, 
new Path("output")); 
conf.setMapperClass( 
LineIndexMapper.class); 
conf.setReducerClass( 
LineIndexReducer.class); 
client.setConf(conf); 
try { 
JobClient.runJob(conf); 
} catch (Exception e) { 
Tuesday, September 30, 14 
boilerplate...
LineIndexMapper.class); 
conf.setReducerClass( 
LineIndexReducer.class); 
client.setConf(conf); 
try { 
JobClient.runJob(conf); 
} catch (Exception e) { 
e.printStackTrace(); 
} 
} 
public static class LineIndexMapper 
extends MapReduceBase 
implements Mapper<LongWritable, Text, 
Text, Text> { 
private final static Text word = 
new Text(); 
private final static Text location = 
new Text(); 
Tuesday, September 30, 14 
main ends with a try-catch clause to run the 
job.
public static class LineIndexMapper 
extends MapReduceBase 
implements Mapper<LongWritable, Text, 
Text, Text> { 
private final static Text word = 
new Text(); 
private final static Text location = 
new Text(); 
public void map( 
LongWritable key, Text val, 
OutputCollector<Text, Text> output, 
Reporter reporter) throws IOException { 
FileSplit fileSplit = 
(FileSplit)reporter.getInputSplit(); 
String fileName = 
fileSplit.getPath().getName(); 
location.set(fileName); 
Tuesday, September 30, 14 
This is the LineIndexMapper class for the mapper. The map method does the real work of tokenization and writing the (word, document-name) 
tuples.
Reporter reporter) throws IOException { 
FileSplit fileSplit = 
(FileSplit)reporter.getInputSplit(); 
String fileName = 
fileSplit.getPath().getName(); 
location.set(fileName); 
String line = val.toString(); 
StringTokenizer itr = new 
StringTokenizer(line.toLowerCase()); 
while (itr.hasMoreTokens()) { 
word.set(itr.nextToken()); 
output.collect(word, location); 
} 
} 
} 
public static class LineIndexReducer 
extends MapReduceBase 
implements Reducer<Text, Text, 
Text, Text> { 
Tuesday, September 30, 14 
The rest of the LineIndexMapper class and map 
method.
public static class LineIndexReducer 
extends MapReduceBase 
implements Reducer<Text, Text, 
Text, Text> { 
public void reduce(Text key, 
Iterator<Text> values, 
OutputCollector<Text, Text> output, 
Reporter reporter) throws IOException { 
boolean first = true; 
StringBuilder toReturn = 
new StringBuilder(); 
while (values.hasNext()) { 
if (!first) 
toReturn.append(", "); 
first=false; 
toReturn.append( 
values.next().toString()); 
} 
output.collect(key, 
new Text(toReturn.toString())); 
} 
Tuesday, September 30, 14 
The reducer class, LineIndexReducer, with the reduce method that is called for each key and a list of values for that key. The reducer is 
stupid; it just reformats the values collection into a long string and writes the final (word,list-string) output.
Reporter reporter) throws IOException { 
boolean first = true; 
StringBuilder toReturn = 
new StringBuilder(); 
while (values.hasNext()) { 
if (!first) 
toReturn.append(", "); 
first=false; 
toReturn.append( 
values.next().toString()); 
} 
output.collect(key, 
new Text(toReturn.toString())); 
} 
} 
} 
Tuesday, September 30, 14 
EOF
Altogether 
import java.io.IOException; 
import java.util.*; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.*; 
import org.apache.hadoop.mapred.*; 
public class LineIndexer { 
public static void main(String[] args) { 
JobClient client = new JobClient(); 
JobConf conf = 
new JobConf(LineIndexer.class); 
conf.setJobName("LineIndexer"); 
conf.setOutputKeyClass(Text.class); 
conf.setOutputValueClass(Text.class); 
FileInputFormat.addInputPath(conf, 
new Path("input")); 
FileOutputFormat.setOutputPath(conf, 
new Path("output")); 
conf.setMapperClass( 
LineIndexMapper.class); 
conf.setReducerClass( 
LineIndexReducer.class); 
client.setConf(conf); 
try { 
JobClient.runJob(conf); 
} catch (Exception e) { 
e.printStackTrace(); 
} 
} 
public static class LineIndexMapper 
extends MapReduceBase 
implements Mapper<LongWritable, Text, 
Text, Text> { 
private final static Text word = 
new Text(); 
private final static Text location = 
new Text(); 
public void map( 
LongWritable key, Text val, 
OutputCollector<Text, Text> output, 
Reporter reporter) throws IOException { 
FileSplit fileSplit = 
(FileSplit)reporter.getInputSplit(); 
String fileName = 
fileSplit.getPath().getName(); 
location.set(fileName); 
String line = val.toString(); 
StringTokenizer itr = new 
StringTokenizer(line.toLowerCase()); 
while (itr.hasMoreTokens()) { 
word.set(itr.nextToken()); 
output.collect(word, location); 
} 
} 
} 
public static class LineIndexReducer 
extends MapReduceBase 
implements Reducer<Text, Text, 
Text, Text> { 
public void reduce(Text key, 
Iterator<Text> values, 
OutputCollector<Text, Text> output, 
Reporter reporter) throws IOException { 
boolean first = true; 
StringBuilder toReturn = 
new StringBuilder(); 
while (values.hasNext()) { 
if (!first) 
toReturn.append(", "); 
first=false; 
toReturn.append( 
values.next().toString()); 
} 
output.collect(key, 
new Text(toReturn.toString())); 
} 
} 
} 
Tuesday, September 30, 14 
The 
whole 
shebang 
(6pt. 
font)
Inverted Index 
in Spark 
(Scala). 
Tuesday, September 30, 14 
This 
code 
is 
approximately 
45 
lines, 
but 
it 
does 
more 
than 
the 
previous 
Java 
example, 
it 
implements 
the 
original 
inverted 
index 
algorithm 
I 
diagrammed 
where 
word 
counts 
are 
computed 
and 
included 
in 
the 
data.
import org.apache.spark.SparkContext 
import org.apache.spark.SparkContext._ 
object InvertedIndex { 
def main(args: Array[String]) = { 
val sc = new SparkContext( 
"local", "Inverted Index") 
sc.textFile("data/crawl") 
.map { line => 
val array = line.split("t", 2) 
(array(0), array(1)) 
} 
.flatMap { 
case (path, text) => 
text.split("""W+""") map { 
word => (word, path) 
} 
Tuesday, September 30, 14 
It 
starts 
with 
imports, 
then 
declares 
a 
singleton 
object 
(a 
first-­‐class 
concept 
in 
Scala), 
with 
a 
main 
rouEne 
(as 
in 
Java). 
The 
methods 
are 
colored 
yellow 
again. 
Note 
this 
Eme 
how 
dense 
with 
meaning 
they 
are 
this 
Eme. 
}
val sc = new SparkContext( 
"local", "Inverted Index") 
sc.textFile("data/crawl") 
.map { line => 
val array = line.split("t", 2) 
(array(0), array(1)) 
} 
.flatMap { 
case (path, text) => 
text.split("""W+""") map { 
word => (word, path) 
} 
} 
.map { 
case (w, p) => ((w, p), 1) 
} 
.reduceByKey { 
case (n1, n2) => n1 + n2 
} 
.map { 
Tuesday, September 30, 14 
You 
being 
the 
workflow 
by 
declaring 
a 
SparkContext 
(in 
“local” 
mode, 
in 
this 
case). 
The 
rest 
of 
the 
program 
is 
a 
sequence 
of 
funcEon 
calls, 
analogous 
to 
“pipes” 
we 
connect 
together 
to 
perform 
the 
data 
flow. 
Next 
we 
read 
one 
or 
more 
text 
files. 
If 
“data/crawl” 
has 
1 
or 
more 
Hadoop-­‐style 
“part-­‐NNNNN” 
files, 
Spark 
will 
process 
all 
of 
them 
(in 
parallel 
if 
running 
a 
distributed 
configuraEon; 
they 
will 
be 
processed 
synchronously 
in 
local 
mode).
sc.textFile("data/crawl") 
.map { line => 
val array = line.split("t", 2) 
(array(0), array(1)) 
} 
.flatMap { 
case (path, text) => 
text.split("""W+""") map { 
word => (word, path) 
} 
} 
.map { 
case (w, p) => ((w, p), 1) 
} 
.reduceByKey { 
case (n1, n2) => n1 + n2 
} 
.map { 
case ((w, p), n) => (w, (p, n)) 
} 
.groupBy { 
case (w, (p, n)) => w 
Tuesday, September 30, 14 
sc.textFile 
returns 
an 
RDD 
with 
a 
string 
for 
each 
line 
of 
input 
text. 
So, 
the 
first 
thing 
we 
do 
is 
map 
over 
these 
strings 
to 
extract 
the 
original 
document 
id 
(i.e., 
file 
name), 
followed 
by 
the 
text 
in 
the 
document, 
all 
on 
one 
line. 
We 
assume 
tab 
is 
the 
separator. 
“(array(0), 
array(1))” 
returns 
a 
two-­‐element 
“tuple”. 
Think 
of 
the 
output 
RDD 
has 
having 
a 
schema 
“String 
fileName, 
String 
text”.
} 
.flatMap { 
case (path, text) => 
text.split("""W+""") map { 
word => (word, path) 
} 
Beautiful, 
} 
.map { 
case (w, p) => ((w, p), 1) 
} 
.reduceByKey { 
case (n1, n2) => n1 + n2 
} 
.map { 
case ((w, p), n) => (w, (p, n)) 
} 
.groupBy { 
case (w, (p, n)) => w 
} 
.map { 
case (w, seq) => 
no? 
Tuesday, September 30, 14 
flatMap 
maps 
over 
each 
of 
these 
2-­‐element 
tuples. 
We 
split 
the 
text 
into 
words 
on 
non-­‐alphanumeric 
characters, 
then 
output 
collecEons 
of 
word 
(our 
ulEmate, 
final 
“key”) 
and 
the 
path. 
Each 
line 
is 
converted 
to 
a 
collecEon 
of 
(word,path) 
pairs, 
so 
flatMap 
converts 
the 
collecEon 
of 
collecEons 
into 
one 
long 
“flat” 
collecEon 
of 
(word,path) 
pairs. 
Then 
we 
map 
over 
these 
pairs 
and 
add 
a 
single 
count 
of 
1.
Tuesday, September 30, 14 
Another 
example 
of 
a 
beauEful 
and 
profound 
DSL, 
in 
this 
case 
from 
the 
world 
of 
Physics: 
Maxwell’s 
equaEons: 
hOp://upload.wikimedia.org/ 
wikipedia/commons/c/c4/Maxwell'sEquaEons.svg
case (w, p) => ((w, p), 1) 
} 
.reduceByKey { 
case (n1, n2) => n1 + n2 
} 
.map { 
case ((w, p), n) => (w, (p, n)) 
} 
.groupBy { 
case (w, (p, n)) => w 
} 
.map { 
case (w, seq) => 
(word1, 
(path1, 
n1)) 
(word2, 
(path2, 
n2)) 
... 
val seq2 = seq map { 
case (_, (p, n)) => (p, n) 
} 
(w, seq2.mkString(", ")) 
} 
.saveAsTextFile(argz.outpath) 
sc.stop() 
} 
Tuesday, September 30, 14 
reduceByKey 
does 
an 
implicit 
“group 
by” 
to 
bring 
together 
all 
occurrences 
of 
the 
same 
(word, 
path) 
and 
then 
sums 
up 
their 
counts. 
Note 
the 
input 
to 
the 
next 
map 
is 
now 
((word, 
path), 
n), 
where 
n 
is 
now 
>= 
1. 
We 
transform 
these 
tuples 
into 
the 
form 
we 
actually 
want, 
(word, 
(path, 
n)).
} 
.groupBy { 
case (w, (p, n)) => w 
} 
.map { 
case (w, seq) => 
(word, 
seq((word, 
(path1, 
n1)), 
(word, 
(path2, 
n2)), 
...)) 
val seq2 = seq map { 
case (_, (p, n)) => (p, n) 
} 
(w, seq2.mkString(", ")) 
} 
.saveAsTextFile(argz.outpath) 
sc.stop() 
} 
} 
(word, 
“(path1, 
n1), 
(path2, 
n2), 
...”) 
Tuesday, September 30, 14 
Now 
we 
do 
an 
explicit 
group 
by 
to 
bring 
all 
the 
same 
words 
together. 
The 
output 
will 
be 
(word, 
(word, 
(path1, 
n1)), 
(word, 
(path2, 
n2)), 
...). 
The 
last 
map 
removes 
the 
redundant 
“word” 
values 
in 
the 
sequences 
of 
the 
previous 
output. 
It 
outputs 
the 
sequence 
as 
a 
final 
string 
of 
comma-­‐ 
separated 
(path,n) 
pairs. 
We 
finish 
by 
saving 
the 
output 
as 
text 
file(s) 
and 
stopping 
the 
workflow.
import org.apache.spark.SparkContext 
import org.apache.spark.SparkContext._ 
object InvertedIndex { 
def main(args: Array[String]) = { 
val sc = new SparkContext( 
"local", "Inverted Index") 
sc.textFile("data/crawl") 
.map { line => 
val array = line.split("t", 2) 
(array(0), array(1)) 
} 
.flatMap { 
case (path, text) => 
text.split("""W+""") map { 
word => (word, path) 
} 
} 
.map { 
case (w, p) => ((w, p), 1) 
} 
.reduceByKey { 
case (n1, n2) => n1 + n2 
} 
.map { 
case ((w, p), n) => (w, (p, n)) 
} 
.groupBy { 
case (w, (p, n)) => w 
} 
.map { 
case (w, seq) => 
val seq2 = seq map { 
case (_, (p, n)) => (p, n) 
} 
(w, seq2.mkString(", ")) 
} 
.saveAsTextFile(argz.outpath) 
sc.stop() 
} 
} 
Altogether 
Tuesday, September 30, 14 
The 
whole 
shebang 
(12pt. 
font, 
this 
Eme)
sc.textFile("data/crawl") 
.map { line => 
val array = line.split("t", 2) 
(array(0), array(1)) 
} 
.flatMap { 
case (path, text) => 
text.split("""W+""") map { 
word => (word, path) 
} 
Concise 
Combinators! 
} 
.map { 
case (w, p) => ((w, p), 1) 
} 
.reduceByKey{ 
case (n1, n2) => n1 + n2 
} 
.map { 
case ((w, p), n) => (w, (p, n)) 
} 
.groupBy { 
Tuesday, September 30, 14 
I’ve 
shortened 
the 
original 
code 
a 
bit, 
e.g., 
using 
* 
import 
statements 
instead 
of 
separate 
imports 
for 
each 
class. 
I’m 
not 
going 
to 
explain 
every 
line 
... 
nor 
most 
lines. 
Everything 
is 
in 
one 
outer 
class. 
We 
start 
with 
a 
main 
rouEne 
that 
sets 
up 
the 
job. 
LoOa 
boilerplate...
The Spark version 
took me ~30 minutes 
to write. 
Tuesday, September 30, 14 
Once 
you 
learn 
the 
core 
primiEves 
I 
used, 
and 
a 
few 
tricks 
for 
manipulaEng 
the 
RDD 
tuples, 
you 
can 
very 
quickly 
build 
complex 
algorithms 
for 
data 
processing! 
The 
Spark 
API 
allowed 
us 
to 
focus 
almost 
exclusively 
on 
the 
“domain” 
of 
data 
transformaEons, 
while 
the 
Java 
MapReduce 
version 
(which 
does 
less), 
forced 
tedious 
aOenEon 
to 
infrastructure 
mechanics.
Use a SQL query when 
you can!! 
Tuesday, September 30, 14
Spark SQL! 
Mix SQL queries with 
the RDD API. 
Tuesday, September 30, 14 
Use the best tool for a particular 
problem.
Spark SQL! 
Create, Read, and Delete 
Hive Tables 
Tuesday, September 30, 14 
Interoperate with Hive, the original Hadoop SQL tool.
Spark SQL! 
Read JSON and 
Infer the Schema 
Tuesday, September 30, 14 
Read strings that are JSON records, infer the schema on the fly. Also, write RDD records as 
JSON.
SparkSQL 
~10-100x the performance of 
Hive, due to in-memory 
caching of RDDs & better 
Spark abstractions. 
Tuesday, September 30, 14
Combine SparkSQL 
queries with Machine 
Learning code. 
Tuesday, September 30, 14 
We’ll 
use 
the 
Spark 
“MLlib” 
in 
the 
example, 
then 
return 
to 
it 
in 
a 
moment.
CREATE TABLE Users( 
userId INT, 
name STRING, 
email STRING, 
age INT, 
latitude DOUBLE, 
longitude DOUBLE, 
subscribed BOOLEAN); 
CREATE TABLE Events( 
userId INT, 
action INT); 
HiveQL 
Schemas 
definiEons. 
Tuesday, September 30, 14 
This 
example 
adapted 
from 
the 
following 
blog 
post 
announcing 
Spark 
SQL: 
hOp://databricks.com/blog/2014/03/26/Spark-­‐SQL-­‐manipulaEng-­‐structured-­‐data-­‐using-­‐Spark.html 
Assume 
we 
have 
these 
Hive/Shark 
tables, 
with 
data 
about 
our 
users 
and 
events 
that 
have 
occurred.
val trainingDataTable = sql(""" 
SELECT e.action, u.age, 
u.latitude, u.longitude 
FROM Users u 
JOIN Events e 
ON u.userId = e.userId""") 
val trainingData = 
trainingDataTable map { row => 
val features = 
Array[Double](row(1), row(2), row(3)) 
LabeledPoint(row(0), features) 
} 
val model = 
new LogisticRegressionWithSGD() 
.run(trainingData) 
val allCandidates = sql(""" 
SELECT userId, age, latitude, longitude 
Tuesday, September 30, 14 
Here 
is 
some 
Spark 
(Scala) 
code 
with 
an 
embedded 
SQL 
query 
that 
joins 
the 
Users 
and 
Events 
tables. 
The 
“””...””” 
string 
allows 
embedded 
line 
feeds. 
The 
“sql” 
funcEon 
returns 
an 
RDD, 
which 
we 
then 
map 
over 
to 
create 
LabeledPoints, 
an 
object 
used 
in 
Spark’s 
MLlib 
(machine 
learning 
library) 
for 
a 
recommendaEon 
engine. 
The 
“label” 
is 
the 
kind 
of 
event 
and 
the 
user’s 
age 
and 
lat/long 
coordinates 
are 
the 
“features” 
used 
for 
making 
recommendaEons. 
(E.g., 
if 
you’re 
25 
and 
near 
a 
certain 
locaEon 
in 
the 
city, 
you 
might 
be 
interested 
a 
nightclub 
near 
by...)
} 
val model = 
new LogisticRegressionWithSGD() 
.run(trainingData) 
val allCandidates = sql(""" 
SELECT userId, age, latitude, longitude 
FROM Users 
WHERE subscribed = FALSE""") 
case class Score( 
userId: Int, score: Double) 
val scores = allCandidates map { row => 
val features = 
Array[Double](row(1), row(2), row(3)) 
Score(row(0), model.predict(features)) 
} 
// Hive table 
scores.registerAsTable("Scores") 
Tuesday, September 30, 14 
Next 
we 
train 
the 
recommendaEon 
engine, 
using 
a 
“logisEc 
regression” 
fit 
to 
the 
training 
data, 
where 
“stochasEc 
gradient 
descent” 
(SGD) 
is 
used 
to 
train 
it. 
(This 
is 
a 
standard 
tool 
set 
for 
recommendaEon 
engines; 
see 
for 
example: 
hOp://www.cs.cmu.edu/~wcohen/10-­‐605/assignments/ 
sgd.pdf)
.run(trainingData) 
val allCandidates = sql(""" 
SELECT userId, age, latitude, longitude 
FROM Users 
WHERE subscribed = FALSE""") 
case class Score( 
userId: Int, score: Double) 
val scores = allCandidates map { row => 
val features = 
Array[Double](row(1), row(2), row(3)) 
Score(row(0), model.predict(features)) 
} 
// Hive table 
scores.registerAsTable("Scores") 
val topCandidates = sql(""" 
SELECT u.name, u.email 
FROM Scores s 
JOIN Users u ON s.userId = u.userId 
Tuesday, September 30, 14 
Now 
run 
a 
query 
against 
all 
users 
who 
aren’t 
already 
subscribed 
to 
noEficaEons.
case class Score( 
userId: Int, score: Double) 
val scores = allCandidates map { row => 
val features = 
Array[Double](row(1), row(2), row(3)) 
Score(row(0), model.predict(features)) 
} 
// Hive table 
scores.registerAsTable("Scores") 
val topCandidates = sql(""" 
SELECT u.name, u.email 
FROM Scores s 
JOIN Users u ON s.userId = u.userId 
ORDER BY score DESC 
LIMIT 100""") Tuesday, September 30, 14 
Declare 
a 
class 
to 
hold 
each 
user’s 
“score” 
as 
produced 
by 
the 
recommendaEon 
engine 
and 
map 
the 
“all” 
query 
results 
to 
Scores. 
Then 
“register” 
the 
scores 
RDD 
as 
a 
“Scores” 
table 
in 
Hive’s 
metadata 
respository. 
This 
is 
equivalent 
to 
running 
a 
“CREATE 
TABLE 
Scores 
...” 
command 
at 
the 
Hive/Shark 
prompt!
} 
// Hive table 
scores.registerAsTable("Scores") 
val topCandidates = sql(""" 
SELECT u.name, u.email 
FROM Scores s 
JOIN Users u ON s.userId = u.userId 
ORDER BY score DESC 
LIMIT 100""") 
Tuesday, September 30, 14 
Finally, 
run 
a 
new 
query 
to 
find 
the 
people 
with 
the 
highest 
scores 
that 
aren’t 
already 
subscribing 
to 
noEficaEons. 
You 
might 
send 
them 
an 
email 
next 
recommending 
that 
they 
subscribe...
Event Stream 
Processing 
Tuesday, September 30, 14
Spark Streaming 
Use the same abstractions 
for near real-time, 
event streaming. 
Tuesday, September 30, 14 
Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little 
code.
Tuesday, September 30, 14 
A DSTream (discretized stream) wraps the RDDs for each “batch” of events. You can specify the granularity, such as all events in 1 second 
batches, then your Spark job is passed each batch of data for processing. You can also work with moving windows of batches.
Very similar code... 
Tuesday, September 30, 14
val sc = new SparkContext(...) 
val ssc = new StreamingContext( 
sc, Seconds(1)) 
// A DStream that will listen to server:port 
val lines = 
ssc.socketTextStream(server, port) 
// Word Count... 
val words = lines flatMap { 
line => line.split("""W+""") 
} 
val pairs = words map (word => (word, 1)) 
val wordCounts = 
pairs reduceByKey ((n1, n2) => n1 + n2) 
wordCount.print() // print a few counts... 
ssc.start() 
Tuesday, September 30, 14 
This 
example 
adapted 
from 
the 
following 
page 
on 
the 
Spark 
website: 
hOp://spark.apache.org/docs/0.9.0/streaming-­‐programming-­‐guide.html#a-­‐quick-­‐example 
We 
create 
a 
StreamingContext 
that 
wraps 
a 
SparkContext 
(there 
are 
alternaEve 
ways 
to 
construct 
it...). 
It 
will 
“clump” 
the 
events 
into 
1-­‐second 
intervals. 
Next 
we 
setup 
a 
socket 
to 
stream 
text 
to 
us 
from 
another 
server 
and 
port 
(one 
of 
several 
ways 
to 
ingest 
data).
ssc.socketTextStream(server, port) 
// Word Count... 
val words = lines flatMap { 
line => line.split("""W+""") 
} 
val pairs = words map (word => (word, 1)) 
val wordCounts = 
pairs reduceByKey ((n1, n2) => n1 + n2) 
wordCount.print() // print a few counts... 
ssc.start() 
ssc.awaitTermination() 
Tuesday, September 30, 14 
Now 
the 
“word 
count” 
happens 
over 
each 
interval 
(aggregaEon 
across 
intervals 
is 
also 
possible), 
but 
otherwise 
it 
works 
like 
before. 
Once 
we 
setup 
the 
flow, 
we 
start 
it 
and 
wait 
for 
it 
to 
terminate 
through 
some 
means, 
such 
as 
the 
server 
socket 
closing.
Machine Learning 
Library 
Tuesday, September 30, 14
MLlib 
•Linear regression 
•Binary classification 
•Collaborative filtering 
•Clustering 
•Others... 
Tuesday, September 30, 14 
Not as full-featured as more mature toolkits, but the Mahout project has announced they are going to port their algorithms to Spark, which 
include powerful Mathematics, e.g., Matrix support libraries.
Distributed Graph 
Computing 
Tuesday, September 30, 14 
Some problems are more naturally represented as graphs. 
Extends RDDs to support property graphs with directed edges.
Graphs 
•Examples: 
•Facebook friend networks 
•Twitter followers 
•Page Rank 
Tuesday, September 30, 14
Graphs 
•Graph algorithms 
•Friends of friends 
•Minimum spanning tree 
•Cliques 
Tuesday, September 30, 14
Recap 
Tuesday, September 30, 14
Tuesday, September 30, 14 
Why is Spark so good (and Java MapReduce so bad)? Because fundamentally, data analytics is Mathematics and programming tools inspired 
by Mathematics - like Functional Programming - are ideal tools for working with data. This is why Spark code is so concise, yet powerful. This 
is why it is a great platform for performance optimizations. This is why Spark is a great platform for higher-level tools, like SQL, graphs, etc. 
Interest in FP started growing ~10 years ago as a tool to attack concurrency. I believe that data is now driving FP adoption even faster. I know 
many Java shops that switched to Scala when they adopted tools like Spark and Scalding (https://github.com/twitter/scalding).
Spark 
A flexible, scalable distributed 
compute platform with 
concise, powerful APIs and 
higher-order tools. 
spark.apache.org 
Tuesday, September 30, 14
Why Spark Is 
the Next Top 
(Compute) 
Model 
@deanwampler 
polyglotprogramming.com/talks 
Tuesday, September 30, 14 
Copyright (c) 2014, Dean Wampler, All Rights Reserved, unless otherwise noted. 
Image: The London Eye on one side of the Thames, Parliament on the other.

More Related Content

What's hot

Stuart russell and peter norvig artificial intelligence - a modern approach...
Stuart russell and peter norvig   artificial intelligence - a modern approach...Stuart russell and peter norvig   artificial intelligence - a modern approach...
Stuart russell and peter norvig artificial intelligence - a modern approach...
Lê Anh Đạt
 

What's hot (20)

Artificial Intelligence -- Search Algorithms
Artificial Intelligence-- Search Algorithms Artificial Intelligence-- Search Algorithms
Artificial Intelligence -- Search Algorithms
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Python GUI
Python GUIPython GUI
Python GUI
 
AI_Session 10 Local search in continious space.pptx
AI_Session 10 Local search in continious space.pptxAI_Session 10 Local search in continious space.pptx
AI_Session 10 Local search in continious space.pptx
 
Operating Systems Process Scheduling Algorithms
Operating Systems   Process Scheduling AlgorithmsOperating Systems   Process Scheduling Algorithms
Operating Systems Process Scheduling Algorithms
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
AI: Logic in AI
AI: Logic in AIAI: Logic in AI
AI: Logic in AI
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1Introduction Artificial Intelligence a modern approach by Russel and Norvig 1
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1
 
Stuart russell and peter norvig artificial intelligence - a modern approach...
Stuart russell and peter norvig   artificial intelligence - a modern approach...Stuart russell and peter norvig   artificial intelligence - a modern approach...
Stuart russell and peter norvig artificial intelligence - a modern approach...
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Cpu scheduling
Cpu schedulingCpu scheduling
Cpu scheduling
 
Semantic Networks
Semantic NetworksSemantic Networks
Semantic Networks
 
Problem reduction AND OR GRAPH & AO* algorithm.ppt
Problem reduction AND OR GRAPH & AO* algorithm.pptProblem reduction AND OR GRAPH & AO* algorithm.ppt
Problem reduction AND OR GRAPH & AO* algorithm.ppt
 
Servlets
ServletsServlets
Servlets
 
Prolog Programming : Basics
Prolog Programming : BasicsProlog Programming : Basics
Prolog Programming : Basics
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 

Viewers also liked

"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
Lucidworks
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
confluent
 

Viewers also liked (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute model
 
MapReduce and Its Discontents
MapReduce and Its DiscontentsMapReduce and Its Discontents
MapReduce and Its Discontents
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Scala dreaded underscore
Scala dreaded underscoreScala dreaded underscore
Scala dreaded underscore
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
 
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Reactive design: languages, and paradigms
Reactive design: languages, and paradigmsReactive design: languages, and paradigms
Reactive design: languages, and paradigms
 
Cracking the Interview Skills (Coding, Soft Skills, Product Management) Handouts
Cracking the Interview Skills (Coding, Soft Skills, Product Management) HandoutsCracking the Interview Skills (Coding, Soft Skills, Product Management) Handouts
Cracking the Interview Skills (Coding, Soft Skills, Product Management) Handouts
 
Promcon2016
Promcon2016Promcon2016
Promcon2016
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with Spark
 
Apache Spark チュートリアル
Apache Spark チュートリアルApache Spark チュートリアル
Apache Spark チュートリアル
 
Asynchronous API in Java8, how to use CompletableFuture
Asynchronous API in Java8, how to use CompletableFutureAsynchronous API in Java8, how to use CompletableFuture
Asynchronous API in Java8, how to use CompletableFuture
 
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
 
Monitoring Kafka w/ Prometheus
Monitoring Kafka w/ PrometheusMonitoring Kafka w/ Prometheus
Monitoring Kafka w/ Prometheus
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 

Similar to Why Spark Is the Next Top (Compute) Model

Parallel Computing for Econometricians with Amazon Web Services
Parallel Computing for Econometricians with Amazon Web ServicesParallel Computing for Econometricians with Amazon Web Services
Parallel Computing for Econometricians with Amazon Web Services
stephenjbarr
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
 

Similar to Why Spark Is the Next Top (Compute) Model (20)

Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
Parallel Computing for Econometricians with Amazon Web Services
Parallel Computing for Econometricians with Amazon Web ServicesParallel Computing for Econometricians with Amazon Web Services
Parallel Computing for Econometricians with Amazon Web Services
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data World
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoop
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
lecture3.pdf
lecture3.pdflecture3.pdf
lecture3.pdf
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 

Recently uploaded

如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
jk0tkvfv
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
yulianti213969
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
pwgnohujw
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
ppy8zfkfm
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
Amil baba
 

Recently uploaded (20)

如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 

Why Spark Is the Next Top (Compute) Model

  • 1. Why Spark Is the Next Top (Compute) Model Big Data Everywhere Chicago, Oct. 1, 2014 @deanwampler polyglotprogramming.com/talks Tuesday, September 30, 14 Copyright (c) 2014, Dean Wampler, All Rights Reserved, unless otherwise noted. Image: Detail of the London Eye
  • 2. Dean Wampler dean.wampler@typesafe.com polyglotprogramming.com/talks @deanwampler Tuesday, September 30, 14 About me. You can find this presentation and others on Big Data and Scala at polyglotprogramming.com. Programming Scala, 2nd Edition is forthcoming. photo: Dusk at 30,000 ft above the Central Plains of the U.S. on a Winter’s Day.
  • 3. Or this? Tuesday, September 30, 14 photo: https://twitter.com/john_overholt/status/447431985750106112/photo/1
  • 4. Hadoop circa 2013 Tuesday, September 30, 14 The state of Hadoop as of last year. Image: Detail of the London Eye
  • 5. Hadoop v2.X Cluster node Node Mgr Data Node DiskDiskDiskDiskDisk node Node Mgr Data Node DiskDiskDiskDiskDisk node Node Mgr Data Node DiskDiskDiskDiskDisk master Resource Mgr Name Node master Name Node Tuesday, September 30, 14 Schematic view of a Hadoop 2 cluster. For a more precise definition of the services and what they do, see e.g., http://hadoop.apache.org/ docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html We aren’t interested in great details at this point, but we’ll call out a few useful things to know.
  • 6. Resource and Node Managers Hadoop v2.X Cluster node Node Mgr Data Node DiskDiskDiskDiskDisk node Node Mgr Data Node DiskDiskDiskDiskDisk node Node Mgr Data Node DiskDiskDiskDiskDisk master Resource Mgr Name Node master Name Node Tuesday, September 30, 14 Hadoop 2 uses YARN to manage resources via the master Resource Manager, which includes a pluggable job scheduler and an Applications Manager. It coordinates with the Node Manager on each node to schedule jobs and provide resources. Other services involved, including application-specific Containers and Application Masters are not shown.
  • 7. Name Node and Data Nodes Hadoop v2.X Cluster node Node Mgr Data Node DiskDiskDiskDiskDisk node Node Mgr Data Node DiskDiskDiskDiskDisk node Node Mgr Data Node DiskDiskDiskDiskDisk master Resource Mgr Name Node master Name Node Tuesday, September 30, 14 Hadoop 2 clusters federate the Name node services that manage the file system, HDFS. They provide horizontal scalability of file-system operations and resiliency when service instances fail. The data node services manage individual blocks for files.
  • 8. MapReduce The classic compute model for Hadoop Tuesday, September 30, 14 Hadoop 2 clusters federate the Name node services that manage the file system, HDFS. They provide horizontal scalability of file-system operations and resiliency when service instances fail. The data node services manage individual blocks for files.
  • 9. MapReduce 1 map step + 1 reduce step (wash, rinse, repeat) Tuesday, September 30, 14 You get 1 map step (although there is limited support for chaining mappers) and 1 reduce step. If you can’t implement an algorithm in these two steps, you can chain jobs together, but you’ll pay a tax of flushing the entire data set to disk between these jobs.
  • 10. MapReduce Example: Inverted Index Tuesday, September 30, 14 The inverted index is a classic algorithm needed for building search engines.
  • 11. Tuesday, September 30, 14 Before running MapReduce, crawl teh interwebs, find all the pages, and build a data set of URLs -> doc contents, written to flat files in HDFS or one of the more “sophisticated” formats.
  • 12. Tuesday, September 30, 14 Now we’re running MapReduce. In the map step, a task (JVM process) per file *block* (64MB or larger) reads the rows, tokenizes the text and outputs key-value pairs (“tuples”)...
  • 13. Map Task (hadoop,(wikipedia.org/hadoop,1)) (provides,(wikipedia.org/hadoop,1)) (mapreduce,(wikipedia.org/hadoop, 1)) (and,(wikipedia.org/hadoop,1)) (hdfs,(wikipedia.org/hadoop, 1)) Tuesday, September 30, 14 ... the keys are each word found and the values are themselves tuples, each URL and the count of the word. In our simplified example, there are typically only single occurrences of each work in each document. The real occurrences are interesting because if a word is mentioned a lot in a document, the chances are higher that you would want to find that document in a search for that word.
  • 15. Tuesday, September 30, 14 The output tuples are sorted by key locally in each map task, then “shuffled” over the cluster network to reduce tasks (each a JVM process, too), where we want all occurrences of a given key to land on the same reduce task.
  • 16. Tuesday, September 30, 14 Finally, each reducer just aggregates all the values it receives for each key, then writes out new files to HDFS with the words and a list of (URL-count) tuples (pairs).
  • 17. Altogether... Tuesday, September 30, 14 Finally, each reducer just aggregates all the values it receives for each key, then writes out new files to HDFS with the words and a list of (URL-count) tuples (pairs).
  • 18. What’s not to like? Tuesday, September 30, 14 This seems okay, right? What’s wrong with it?
  • 19. Awkward Most algorithms are much harder to implement in this restrictive map-then-reduce model. Tuesday, September 30, 14 Writing MapReduce jobs requires arcane, specialized skills that few master. For a good overview, see http://lintool.github.io/ MapReduceAlgorithms/.
  • 20. Awkward Lack of flexibility inhibits optimizations, too. Tuesday, September 30, 14 The inflexible compute model leads to complex code to improve performance where hacks are used to work around the limitations. Hence, optimizations are hard to implement. The Spark team has commented on this, see http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating- structured-data-using-Spark.html
  • 21. Performance Full dump to disk between jobs. Tuesday, September 30, 14 Sequencing jobs wouldn’t be so bad if the “system” was smart enough to cache data in memory. Instead, each job dumps everything to disk, then the next job reads it back in again. This makes iterative algorithms particularly painful.
  • 22. Enter Spark spark.apache.org Tuesday, September 30, 14
  • 23. Cluster Computing Can be run in: •YARN (Hadoop 2) •Mesos (Cluster management) •EC2 •Standalone mode Tuesday, September 30, 14 If you have a Hadoop cluster, you can run Spark as a seamless compute engine on YARN. (You can also use pre-YARN Hadoop v1 clusters, but there you have manually allocate resources to the embedded Spark cluster vs Hadoop.) Mesos is a general-purpose cluster resource manager that can also be used to manage Hadoop resources. Scripts for running a Spark cluster in EC2 are available. Standalone just means you run Spark’s built-in support for clustering (or run locally on a single box - e.g., for development). EC2 deployments are usually standalone...
  • 24. Compute Model Fine-grained “combinators” for composing algorithms. Tuesday, September 30, 14 Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little code.
  • 25. Compute Model RDDs: Resilient, Distributed Datasets Tuesday, September 30, 14 RDDs shard the data over a cluster, like a virtualized, distributed collection (analogous to HDFS). They support intelligent caching, which means no naive flushes of massive datasets to disk. This feature alone allows Spark jobs to run 10-100x faster than comparable MapReduce jobs! The “resilient” part means they will reconstitute shards lost due to process/server crashes.
  • 26. Compute Model Tuesday, September 30, 14 RDDs shard the data over a cluster, like a virtualized, distributed collection (analogous to HDFS). They support intelligent caching, which means no naive flushes of massive datasets to disk. This feature alone allows Spark jobs to run 10-100x faster than comparable MapReduce jobs! The “resilient” part means they will reconstitute shards lost due to process/server crashes.
  • 27. Compute Model Written in Scala, with Java and Python APIs. Tuesday, September 30, 14 Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little code.
  • 28. Inverted Index in MapReduce (Java). Tuesday, September 30, 14 Let’s see an an actual implementaEon of the inverted index. First, a Hadoop MapReduce (Java) version, adapted from hOps:// developer.yahoo.com/hadoop/tutorial/module4.html#soluEon It’s about 90 lines of code, but I reformaOed to fit beOer. This is also a slightly simpler version that the one I diagrammed. It doesn’t record a count of each word in a document; it just writes (word,doc-­‐Etle) pairs out of the mappers and the final (word,list) output by the reducers just has a list of documentaEons, hence repeats. A second job would be necessary to count the repeats.
  • 29. import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class LineIndexer { public static void main(String[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(LineIndexer.class); conf.setJobName("LineIndexer"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); FileInputFormat.addInputPath(conf, new Path("input")); FileOutputFormat.setOutputPath(conf, Tuesday, September 30, 14 I’ve shortened the original code a bit, e.g., using * import statements instead of separate imports for each class. I’m not going to explain every line ... nor most lines. Everything is in one outer class. We start with a main rouEne that sets up the job. LoOa boilerplate... I used yellow for method calls, because methods do the real work!! But noEce that the funcEons in this code don’t really do a whole lot...
  • 30. JobClient client = new JobClient(); JobConf conf = new JobConf(LineIndexer.class); conf.setJobName("LineIndexer"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); FileInputFormat.addInputPath(conf, new Path("input")); FileOutputFormat.setOutputPath(conf, new Path("output")); conf.setMapperClass( LineIndexMapper.class); conf.setReducerClass( LineIndexReducer.class); client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { Tuesday, September 30, 14 boilerplate...
  • 31. LineIndexMapper.class); conf.setReducerClass( LineIndexReducer.class); client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } } public static class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private final static Text word = new Text(); private final static Text location = new Text(); Tuesday, September 30, 14 main ends with a try-catch clause to run the job.
  • 32. public static class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private final static Text word = new Text(); private final static Text location = new Text(); public void map( LongWritable key, Text val, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName(); location.set(fileName); Tuesday, September 30, 14 This is the LineIndexMapper class for the mapper. The map method does the real work of tokenization and writing the (word, document-name) tuples.
  • 33. Reporter reporter) throws IOException { FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName(); location.set(fileName); String line = val.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, location); } } } public static class LineIndexReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> { Tuesday, September 30, 14 The rest of the LineIndexMapper class and map method.
  • 34. public static class LineIndexReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { boolean first = true; StringBuilder toReturn = new StringBuilder(); while (values.hasNext()) { if (!first) toReturn.append(", "); first=false; toReturn.append( values.next().toString()); } output.collect(key, new Text(toReturn.toString())); } Tuesday, September 30, 14 The reducer class, LineIndexReducer, with the reduce method that is called for each key and a list of values for that key. The reducer is stupid; it just reformats the values collection into a long string and writes the final (word,list-string) output.
  • 35. Reporter reporter) throws IOException { boolean first = true; StringBuilder toReturn = new StringBuilder(); while (values.hasNext()) { if (!first) toReturn.append(", "); first=false; toReturn.append( values.next().toString()); } output.collect(key, new Text(toReturn.toString())); } } } Tuesday, September 30, 14 EOF
  • 36. Altogether import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class LineIndexer { public static void main(String[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(LineIndexer.class); conf.setJobName("LineIndexer"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); FileInputFormat.addInputPath(conf, new Path("input")); FileOutputFormat.setOutputPath(conf, new Path("output")); conf.setMapperClass( LineIndexMapper.class); conf.setReducerClass( LineIndexReducer.class); client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } } public static class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private final static Text word = new Text(); private final static Text location = new Text(); public void map( LongWritable key, Text val, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName(); location.set(fileName); String line = val.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, location); } } } public static class LineIndexReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { boolean first = true; StringBuilder toReturn = new StringBuilder(); while (values.hasNext()) { if (!first) toReturn.append(", "); first=false; toReturn.append( values.next().toString()); } output.collect(key, new Text(toReturn.toString())); } } } Tuesday, September 30, 14 The whole shebang (6pt. font)
  • 37. Inverted Index in Spark (Scala). Tuesday, September 30, 14 This code is approximately 45 lines, but it does more than the previous Java example, it implements the original inverted index algorithm I diagrammed where word counts are computed and included in the data.
  • 38. import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object InvertedIndex { def main(args: Array[String]) = { val sc = new SparkContext( "local", "Inverted Index") sc.textFile("data/crawl") .map { line => val array = line.split("t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } Tuesday, September 30, 14 It starts with imports, then declares a singleton object (a first-­‐class concept in Scala), with a main rouEne (as in Java). The methods are colored yellow again. Note this Eme how dense with meaning they are this Eme. }
  • 39. val sc = new SparkContext( "local", "Inverted Index") sc.textFile("data/crawl") .map { line => val array = line.split("t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { case (n1, n2) => n1 + n2 } .map { Tuesday, September 30, 14 You being the workflow by declaring a SparkContext (in “local” mode, in this case). The rest of the program is a sequence of funcEon calls, analogous to “pipes” we connect together to perform the data flow. Next we read one or more text files. If “data/crawl” has 1 or more Hadoop-­‐style “part-­‐NNNNN” files, Spark will process all of them (in parallel if running a distributed configuraEon; they will be processed synchronously in local mode).
  • 40. sc.textFile("data/crawl") .map { line => val array = line.split("t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { case (n1, n2) => n1 + n2 } .map { case ((w, p), n) => (w, (p, n)) } .groupBy { case (w, (p, n)) => w Tuesday, September 30, 14 sc.textFile returns an RDD with a string for each line of input text. So, the first thing we do is map over these strings to extract the original document id (i.e., file name), followed by the text in the document, all on one line. We assume tab is the separator. “(array(0), array(1))” returns a two-­‐element “tuple”. Think of the output RDD has having a schema “String fileName, String text”.
  • 41. } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } Beautiful, } .map { case (w, p) => ((w, p), 1) } .reduceByKey { case (n1, n2) => n1 + n2 } .map { case ((w, p), n) => (w, (p, n)) } .groupBy { case (w, (p, n)) => w } .map { case (w, seq) => no? Tuesday, September 30, 14 flatMap maps over each of these 2-­‐element tuples. We split the text into words on non-­‐alphanumeric characters, then output collecEons of word (our ulEmate, final “key”) and the path. Each line is converted to a collecEon of (word,path) pairs, so flatMap converts the collecEon of collecEons into one long “flat” collecEon of (word,path) pairs. Then we map over these pairs and add a single count of 1.
  • 42. Tuesday, September 30, 14 Another example of a beauEful and profound DSL, in this case from the world of Physics: Maxwell’s equaEons: hOp://upload.wikimedia.org/ wikipedia/commons/c/c4/Maxwell'sEquaEons.svg
  • 43. case (w, p) => ((w, p), 1) } .reduceByKey { case (n1, n2) => n1 + n2 } .map { case ((w, p), n) => (w, (p, n)) } .groupBy { case (w, (p, n)) => w } .map { case (w, seq) => (word1, (path1, n1)) (word2, (path2, n2)) ... val seq2 = seq map { case (_, (p, n)) => (p, n) } (w, seq2.mkString(", ")) } .saveAsTextFile(argz.outpath) sc.stop() } Tuesday, September 30, 14 reduceByKey does an implicit “group by” to bring together all occurrences of the same (word, path) and then sums up their counts. Note the input to the next map is now ((word, path), n), where n is now >= 1. We transform these tuples into the form we actually want, (word, (path, n)).
  • 44. } .groupBy { case (w, (p, n)) => w } .map { case (w, seq) => (word, seq((word, (path1, n1)), (word, (path2, n2)), ...)) val seq2 = seq map { case (_, (p, n)) => (p, n) } (w, seq2.mkString(", ")) } .saveAsTextFile(argz.outpath) sc.stop() } } (word, “(path1, n1), (path2, n2), ...”) Tuesday, September 30, 14 Now we do an explicit group by to bring all the same words together. The output will be (word, (word, (path1, n1)), (word, (path2, n2)), ...). The last map removes the redundant “word” values in the sequences of the previous output. It outputs the sequence as a final string of comma-­‐ separated (path,n) pairs. We finish by saving the output as text file(s) and stopping the workflow.
  • 45. import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object InvertedIndex { def main(args: Array[String]) = { val sc = new SparkContext( "local", "Inverted Index") sc.textFile("data/crawl") .map { line => val array = line.split("t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { case (n1, n2) => n1 + n2 } .map { case ((w, p), n) => (w, (p, n)) } .groupBy { case (w, (p, n)) => w } .map { case (w, seq) => val seq2 = seq map { case (_, (p, n)) => (p, n) } (w, seq2.mkString(", ")) } .saveAsTextFile(argz.outpath) sc.stop() } } Altogether Tuesday, September 30, 14 The whole shebang (12pt. font, this Eme)
  • 46. sc.textFile("data/crawl") .map { line => val array = line.split("t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } Concise Combinators! } .map { case (w, p) => ((w, p), 1) } .reduceByKey{ case (n1, n2) => n1 + n2 } .map { case ((w, p), n) => (w, (p, n)) } .groupBy { Tuesday, September 30, 14 I’ve shortened the original code a bit, e.g., using * import statements instead of separate imports for each class. I’m not going to explain every line ... nor most lines. Everything is in one outer class. We start with a main rouEne that sets up the job. LoOa boilerplate...
  • 47. The Spark version took me ~30 minutes to write. Tuesday, September 30, 14 Once you learn the core primiEves I used, and a few tricks for manipulaEng the RDD tuples, you can very quickly build complex algorithms for data processing! The Spark API allowed us to focus almost exclusively on the “domain” of data transformaEons, while the Java MapReduce version (which does less), forced tedious aOenEon to infrastructure mechanics.
  • 48. Use a SQL query when you can!! Tuesday, September 30, 14
  • 49. Spark SQL! Mix SQL queries with the RDD API. Tuesday, September 30, 14 Use the best tool for a particular problem.
  • 50. Spark SQL! Create, Read, and Delete Hive Tables Tuesday, September 30, 14 Interoperate with Hive, the original Hadoop SQL tool.
  • 51. Spark SQL! Read JSON and Infer the Schema Tuesday, September 30, 14 Read strings that are JSON records, infer the schema on the fly. Also, write RDD records as JSON.
  • 52. SparkSQL ~10-100x the performance of Hive, due to in-memory caching of RDDs & better Spark abstractions. Tuesday, September 30, 14
  • 53. Combine SparkSQL queries with Machine Learning code. Tuesday, September 30, 14 We’ll use the Spark “MLlib” in the example, then return to it in a moment.
  • 54. CREATE TABLE Users( userId INT, name STRING, email STRING, age INT, latitude DOUBLE, longitude DOUBLE, subscribed BOOLEAN); CREATE TABLE Events( userId INT, action INT); HiveQL Schemas definiEons. Tuesday, September 30, 14 This example adapted from the following blog post announcing Spark SQL: hOp://databricks.com/blog/2014/03/26/Spark-­‐SQL-­‐manipulaEng-­‐structured-­‐data-­‐using-­‐Spark.html Assume we have these Hive/Shark tables, with data about our users and events that have occurred.
  • 55. val trainingDataTable = sql(""" SELECT e.action, u.age, u.latitude, u.longitude FROM Users u JOIN Events e ON u.userId = e.userId""") val trainingData = trainingDataTable map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new LogisticRegressionWithSGD() .run(trainingData) val allCandidates = sql(""" SELECT userId, age, latitude, longitude Tuesday, September 30, 14 Here is some Spark (Scala) code with an embedded SQL query that joins the Users and Events tables. The “””...””” string allows embedded line feeds. The “sql” funcEon returns an RDD, which we then map over to create LabeledPoints, an object used in Spark’s MLlib (machine learning library) for a recommendaEon engine. The “label” is the kind of event and the user’s age and lat/long coordinates are the “features” used for making recommendaEons. (E.g., if you’re 25 and near a certain locaEon in the city, you might be interested a nightclub near by...)
  • 56. } val model = new LogisticRegressionWithSGD() .run(trainingData) val allCandidates = sql(""" SELECT userId, age, latitude, longitude FROM Users WHERE subscribed = FALSE""") case class Score( userId: Int, score: Double) val scores = allCandidates map { row => val features = Array[Double](row(1), row(2), row(3)) Score(row(0), model.predict(features)) } // Hive table scores.registerAsTable("Scores") Tuesday, September 30, 14 Next we train the recommendaEon engine, using a “logisEc regression” fit to the training data, where “stochasEc gradient descent” (SGD) is used to train it. (This is a standard tool set for recommendaEon engines; see for example: hOp://www.cs.cmu.edu/~wcohen/10-­‐605/assignments/ sgd.pdf)
  • 57. .run(trainingData) val allCandidates = sql(""" SELECT userId, age, latitude, longitude FROM Users WHERE subscribed = FALSE""") case class Score( userId: Int, score: Double) val scores = allCandidates map { row => val features = Array[Double](row(1), row(2), row(3)) Score(row(0), model.predict(features)) } // Hive table scores.registerAsTable("Scores") val topCandidates = sql(""" SELECT u.name, u.email FROM Scores s JOIN Users u ON s.userId = u.userId Tuesday, September 30, 14 Now run a query against all users who aren’t already subscribed to noEficaEons.
  • 58. case class Score( userId: Int, score: Double) val scores = allCandidates map { row => val features = Array[Double](row(1), row(2), row(3)) Score(row(0), model.predict(features)) } // Hive table scores.registerAsTable("Scores") val topCandidates = sql(""" SELECT u.name, u.email FROM Scores s JOIN Users u ON s.userId = u.userId ORDER BY score DESC LIMIT 100""") Tuesday, September 30, 14 Declare a class to hold each user’s “score” as produced by the recommendaEon engine and map the “all” query results to Scores. Then “register” the scores RDD as a “Scores” table in Hive’s metadata respository. This is equivalent to running a “CREATE TABLE Scores ...” command at the Hive/Shark prompt!
  • 59. } // Hive table scores.registerAsTable("Scores") val topCandidates = sql(""" SELECT u.name, u.email FROM Scores s JOIN Users u ON s.userId = u.userId ORDER BY score DESC LIMIT 100""") Tuesday, September 30, 14 Finally, run a new query to find the people with the highest scores that aren’t already subscribing to noEficaEons. You might send them an email next recommending that they subscribe...
  • 60. Event Stream Processing Tuesday, September 30, 14
  • 61. Spark Streaming Use the same abstractions for near real-time, event streaming. Tuesday, September 30, 14 Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little code.
  • 62. Tuesday, September 30, 14 A DSTream (discretized stream) wraps the RDDs for each “batch” of events. You can specify the granularity, such as all events in 1 second batches, then your Spark job is passed each batch of data for processing. You can also work with moving windows of batches.
  • 63. Very similar code... Tuesday, September 30, 14
  • 64. val sc = new SparkContext(...) val ssc = new StreamingContext( sc, Seconds(1)) // A DStream that will listen to server:port val lines = ssc.socketTextStream(server, port) // Word Count... val words = lines flatMap { line => line.split("""W+""") } val pairs = words map (word => (word, 1)) val wordCounts = pairs reduceByKey ((n1, n2) => n1 + n2) wordCount.print() // print a few counts... ssc.start() Tuesday, September 30, 14 This example adapted from the following page on the Spark website: hOp://spark.apache.org/docs/0.9.0/streaming-­‐programming-­‐guide.html#a-­‐quick-­‐example We create a StreamingContext that wraps a SparkContext (there are alternaEve ways to construct it...). It will “clump” the events into 1-­‐second intervals. Next we setup a socket to stream text to us from another server and port (one of several ways to ingest data).
  • 65. ssc.socketTextStream(server, port) // Word Count... val words = lines flatMap { line => line.split("""W+""") } val pairs = words map (word => (word, 1)) val wordCounts = pairs reduceByKey ((n1, n2) => n1 + n2) wordCount.print() // print a few counts... ssc.start() ssc.awaitTermination() Tuesday, September 30, 14 Now the “word count” happens over each interval (aggregaEon across intervals is also possible), but otherwise it works like before. Once we setup the flow, we start it and wait for it to terminate through some means, such as the server socket closing.
  • 66. Machine Learning Library Tuesday, September 30, 14
  • 67. MLlib •Linear regression •Binary classification •Collaborative filtering •Clustering •Others... Tuesday, September 30, 14 Not as full-featured as more mature toolkits, but the Mahout project has announced they are going to port their algorithms to Spark, which include powerful Mathematics, e.g., Matrix support libraries.
  • 68. Distributed Graph Computing Tuesday, September 30, 14 Some problems are more naturally represented as graphs. Extends RDDs to support property graphs with directed edges.
  • 69. Graphs •Examples: •Facebook friend networks •Twitter followers •Page Rank Tuesday, September 30, 14
  • 70. Graphs •Graph algorithms •Friends of friends •Minimum spanning tree •Cliques Tuesday, September 30, 14
  • 72. Tuesday, September 30, 14 Why is Spark so good (and Java MapReduce so bad)? Because fundamentally, data analytics is Mathematics and programming tools inspired by Mathematics - like Functional Programming - are ideal tools for working with data. This is why Spark code is so concise, yet powerful. This is why it is a great platform for performance optimizations. This is why Spark is a great platform for higher-level tools, like SQL, graphs, etc. Interest in FP started growing ~10 years ago as a tool to attack concurrency. I believe that data is now driving FP adoption even faster. I know many Java shops that switched to Scala when they adopted tools like Spark and Scalding (https://github.com/twitter/scalding).
  • 73. Spark A flexible, scalable distributed compute platform with concise, powerful APIs and higher-order tools. spark.apache.org Tuesday, September 30, 14
  • 74. Why Spark Is the Next Top (Compute) Model @deanwampler polyglotprogramming.com/talks Tuesday, September 30, 14 Copyright (c) 2014, Dean Wampler, All Rights Reserved, unless otherwise noted. Image: The London Eye on one side of the Thames, Parliament on the other.