SlideShare a Scribd company logo
1 of 69
Download to read offline
Why Spark Is
the Next Top
(Compute)
Model
Philly  ETE  2014
April  22-­‐23,  2014
@deanwampler
polyglotprogramming.com/talks
Thursday, May 1, 14
Copyright (c) 2014, Dean Wampler, All Rights Reserved, unless otherwise noted.
Image: Detail of the London Eye
Dean Wampler
dean.wampler@typesafe.com
polyglotprogramming.com/talks
@deanwampler
Thursday, May 1, 14
About me. You can find this presentation and others on Big Data and Scala at polyglotprogramming.com.
Programming Scala, 2nd Edition is forthcoming.
photo: Dusk at 30,000 ft above the Central Plains of the U.S. on a Winter’s Day.
Or
this?
Thursday, May 1, 14
photo: https://twitter.com/john_overholt/status/447431985750106112/photo/1
Hadoop
circa 2013
Thursday, May 1, 14
The state of Hadoop as of last year.
Image: Detail of the London Eye
Hadoop v2.X Cluster
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
master
Resource Mgr
Name Node
master
Name Node
Thursday, May 1, 14
Schematic view of a Hadoop 2 cluster. For a more precise definition of the services and what they do, see e.g., http://hadoop.apache.org/
docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html We aren’t interested in great details at this point, but we’ll call out a few useful things to
know.
Hadoop v2.X Cluster
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
master
Resource Mgr
Name Node
master
Name Node
Resource and Node Managers
Thursday, May 1, 14
Hadoop 2 uses YARN to manage resources via the master Resource Manager, which includes a pluggable job scheduler and an Applications
Manager. It coordinates with the Node Manager on each node to schedule jobs and provide resources. Other services involved, including
application-specific Containers and Application Masters are not shown.
Hadoop v2.X Cluster
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
master
Resource Mgr
Name Node
master
Name Node
Name Node and Data Nodes
Thursday, May 1, 14
Hadoop 2 clusters federate the Name node services that manage the file system, HDFS. They provide horizontal scalability of file-system
operations and resiliency when service instances fail. The data node services manage individual blocks for files.
MapReduce
The classic compute model
for Hadoop
Thursday, May 1, 14
Hadoop 2 clusters federate the Name node services that manage the file system, HDFS. They provide horizontal scalability of file-system
operations and resiliency when service instances fail. The data node services manage individual blocks for files.
1 map step + 1 reduce step
(wash, rinse, repeat)
MapReduce
Thursday, May 1, 14
You get 1 map step (although there is limited support for chaining mappers) and 1 reduce step. If you can’t implement an algorithm in these
two steps, you can chain jobs together, but you’ll pay a tax of flushing the entire data set to disk between these jobs.
Example:
Inverted Index
MapReduce
Thursday, May 1, 14
The inverted index is a classic algorithm needed for building search engines.
Thursday, May 1, 14
Before running MapReduce, crawl teh interwebs, find all the pages, and build a data set of URLs -> doc contents, written to flat files in HDFS
or one of the more “sophisticated” formats.
Thursday, May 1, 14
Now we’re running MapReduce. In the map step, a task (JVM process) per file *block* (64MB or larger) reads the rows, tokenizes the text and
outputs key-value pairs (“tuples”)...
Map Task
(hadoop,(wikipedia.org/hadoop,1))
(mapreduce,(wikipedia.org/hadoop, 1))
(hdfs,(wikipedia.org/hadoop, 1))
(provides,(wikipedia.org/hadoop,1))
(and,(wikipedia.org/hadoop,1))
Thursday, May 1, 14
... the keys are each word found and the values are themselves tuples, each URL and the count of the word. In our simplified example, there
are typically only single occurrences of each work in each document. The real occurrences are interesting because if a word is mentioned a
lot in a document, the chances are higher that you would want to find that document in a search for that word.
Thursday, May 1, 14
Thursday, May 1, 14
The output tuples are sorted by key locally in each map task, then “shuffled” over the cluster network to reduce tasks (each a JVM process,
too), where we want all occurrences of a given key to land on the same reduce task.
Thursday, May 1, 14
Finally, each reducer just aggregates all the values it receives for each key, then writes out new files to HDFS with the words and a list of
(URL-count) tuples (pairs).
Altogether...
Thursday, May 1, 14
Finally, each reducer just aggregates all the values it receives for each key, then writes out new files to HDFS with the words and a list of
(URL-count) tuples (pairs).
What’s
not to like?Thursday, May 1, 14
This seems okay, right? What’s wrong with it?
Most algorithms are
much harder to implement
in this restrictive
map-then-reduce model.
Awkward
Thursday, May 1, 14
Writing MapReduce jobs requires arcane, specialized skills that few master. For a good overview, see http://lintool.github.io/
MapReduceAlgorithms/.
Lack of flexibility inhibits
optimizations, too.
Awkward
Thursday, May 1, 14
The inflexible compute model leads to complex code to improve performance where hacks are used to work around the limitations. Hence,
optimizations are hard to implement. The Spark team has commented on this, see http://databricks.com/blog/2014/03/26/Spark-SQL-
manipulating-structured-data-using-Spark.html
Full dump to disk
between jobs.
Performance
Thursday, May 1, 14
Sequencing jobs wouldn’t be so bad if the “system” was smart enough to cache data in memory. Instead, each job dumps everything to disk,
then the next job reads it back in again. This makes iterative algorithms particularly painful.
Enter
Spark
spark.apache.org
Thursday, May 1, 14
Can be run in:
•YARN (Hadoop 2)
•Mesos (Cluster management)
•EC2
•Standalone mode
Cluster Computing
Thursday, May 1, 14
If you have a Hadoop cluster, you can run Spark as a seamless compute engine on YARN. (You can also use pre-YARN Hadoop v1 clusters,
but there you have manually allocate resources to the embedded Spark cluster vs Hadoop.) Mesos is a general-purpose cluster resource
manager that can also be used to manage Hadoop resources. Scripts for running a Spark cluster in EC2 are available. Standalone just means
you run Spark’s built-in support for clustering (or run locally on a single box - e.g., for development). EC2 deployments are usually
standalone...
Fine-grained “combinators”
for composing algorithms.
Compute Model
Thursday, May 1, 14
Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little
code.
RDDs:
Resilient,
Distributed
Datasets
Compute Model
Thursday, May 1, 14
RDDs shard the data over a cluster, like a virtualized, distributed collection (analogous to HDFS). They support intelligent caching, which
means no naive flushes of massive datasets to disk. This feature alone allows Spark jobs to run 10-100x faster than comparable MapReduce
jobs! The “resilient” part means they will reconstitute shards lost due to process/server crashes.
Written in Scala,
with Java and Python APIs.
Compute Model
Thursday, May 1, 14
Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little
code.
Inverted Index
in MapReduce
(Java).
Thursday, May 1, 14
Let’s  see  an  an  actual  implementaEon  of  the  inverted  index.  First,  a  Hadoop  MapReduce  (Java)  version,  adapted  from  hPps://
developer.yahoo.com/hadoop/tutorial/module4.html#soluEon  It’s  about  90  lines  of  code,  but  I  reformaPed  to  fit  bePer.
This  is  also  a  slightly  simpler  version  that  the  one  I  diagrammed.  It  doesn’t  record  a  count  of  each  word  in  a  document;  it  just  writes  
(word,doc-­‐Etle)  pairs  out  of  the  mappers  and  the  final  (word,list)  output  by  the  reducers  just  has  a  list  of  documentaEons,  hence  repeats.  
A  second  job  would  be  necessary  to  count  the  repeats.
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class LineIndexer {
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf =
new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf,
new Path("input"));
Thursday, May 1, 14
I’ve  shortened  the  original  code  a  bit,  e.g.,  using  *  import  statements  instead  of  separate  imports  for  each  class.
I’m  not  going  to  explain  every  line  ...  nor  most  lines.
Everything  is  in  one  outer  class.  We  start  with  a  main  rou?ne  that  sets  up  the  job.  LoBa  boilerplate...
I  used  yellow  for  method  calls,  because  methods  do  the  real  work!!  But  no?ce  that  the  func?ons  in  this  code  don’t  really  do  a  whole  lot...
JobClient client = new JobClient();
JobConf conf =
new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf,
new Path("input"));
FileOutputFormat.setOutputPath(conf,
new Path("output"));
conf.setMapperClass(
LineIndexMapper.class);
conf.setReducerClass(
LineIndexReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
Thursday, May 1, 14
boilerplate...
LineIndexMapper.class);
conf.setReducerClass(
LineIndexReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
public static class LineIndexMapper
extends MapReduceBase
implements Mapper<LongWritable, Text,
Text, Text> {
private final static Text word =
new Text();
private final static Text location =
Thursday, May 1, 14
main ends with a try-catch clause to run the
job.
public static class LineIndexMapper
extends MapReduceBase
implements Mapper<LongWritable, Text,
Text, Text> {
private final static Text word =
new Text();
private final static Text location =
new Text();
public void map(
LongWritable key, Text val,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
FileSplit fileSplit =
(FileSplit)reporter.getInputSplit();
String fileName =
fileSplit.getPath().getName();
location.set(fileName);
Thursday, May 1, 14
This is the LineIndexMapper class for the mapper. The map method does the real work of tokenization and writing the (word, document-name)
tuples.
FileSplit fileSplit =
(FileSplit)reporter.getInputSplit();
String fileName =
fileSplit.getPath().getName();
location.set(fileName);
String line = val.toString();
StringTokenizer itr = new
StringTokenizer(line.toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, location);
}
}
}
public static class LineIndexReducer
extends MapReduceBase
implements Reducer<Text, Text,
Thursday, May 1, 14
The rest of the LineIndexMapper class and map
method.
public static class LineIndexReducer
extends MapReduceBase
implements Reducer<Text, Text,
Text, Text> {
public void reduce(Text key,
Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
boolean first = true;
StringBuilder toReturn =
new StringBuilder();
while (values.hasNext()) {
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(
values.next().toString());
}
output.collect(key,
new Text(toReturn.toString()));
Thursday, May 1, 14
The reducer class, LineIndexReducer, with the reduce method that is called for each key and a list of values for that key. The reducer is
stupid; it just reformats the values collection into a long string and writes the final (word,list-string) output.
boolean first = true;
StringBuilder toReturn =
new StringBuilder();
while (values.hasNext()) {
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(
values.next().toString());
}
output.collect(key,
new Text(toReturn.toString()));
}
}
}
Thursday, May 1, 14
EOF
Altogether
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class LineIndexer {
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf =
new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf,
new Path("input"));
FileOutputFormat.setOutputPath(conf,
new Path("output"));
conf.setMapperClass(
LineIndexMapper.class);
conf.setReducerClass(
LineIndexReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
public static class LineIndexMapper
extends MapReduceBase
implements Mapper<LongWritable, Text,
Text, Text> {
private final static Text word =
new Text();
private final static Text location =
new Text();
public void map(
LongWritable key, Text val,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
FileSplit fileSplit =
(FileSplit)reporter.getInputSplit();
String fileName =
fileSplit.getPath().getName();
location.set(fileName);
String line = val.toString();
StringTokenizer itr = new
StringTokenizer(line.toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, location);
}
}
}
public static class LineIndexReducer
extends MapReduceBase
implements Reducer<Text, Text,
Text, Text> {
public void reduce(Text key,
Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
boolean first = true;
StringBuilder toReturn =
new StringBuilder();
while (values.hasNext()) {
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(
values.next().toString());
}
output.collect(key,
new Text(toReturn.toString()));
}
}
}
Thursday, May 1, 14
The  whole  shebang  (6pt.  font)
Inverted Index
in Spark
(Scala).
Thursday, May 1, 14
This  code  is  approximately  45  lines,  but  it  does  more  than  the  previous  Java  example,  it  implements  the  original  inverted  index  algorithm  I  
diagrammed  where  word  counts  are  computed  and  included  in  the  data.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object InvertedIndex {
def main(args: Array[String]) = {
val sc = new SparkContext(
"local", "Inverted Index")
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
Thursday, May 1, 14
It  starts  with  imports,  then  declares  a  singleton  object  (a  first-­‐class  concept  in  Scala),  with  a  main  rou?ne  (as  in  Java).
The  methods  are  colored  yellow  again.  Note  this  ?me  how  dense  with  meaning  they  are  this  ?me.
val sc = new SparkContext(
"local", "Inverted Index")
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
case (n1, n2) => n1 + n2
}
Thursday, May 1, 14
You  being  the  workflow  by  declaring  a  SparkContext  (in  “local”  mode,  in  this  case).  The  rest  of  the  program  is  a  sequence  of  func?on  calls,  
analogous  to  “pipes”  we  connect  together  to  perform  the  data  flow.
Next  we  read  one  or  more  text  files.  If  “data/crawl”  has  1  or  more  Hadoop-­‐style  “part-­‐NNNNN”  files,  Spark  will  process  all  of  them  (in  parallel  if  
running  a  distributed  configura?on;  they  will  be  processed  synchronously  in  local  mode).
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
case (n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupBy {
Thursday, May 1, 14
sc.textFile  returns  an  RDD  with  a  string  for  each  line  of  input  text.  So,  the  first  thing  we  do  is  map  over  these  strings  to  extract  the  original  
document  id  (i.e.,  file  name),  followed  by  the  text  in  the  document,  all  on  one  line.  We  assume  tab  is  the  separator.  “(array(0),  array(1))”  returns  a  
two-­‐element  “tuple”.  Think  of  the  output  RDD  has  having  a  schema  “String  fileName,  String  text”.    
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
case (n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupBy {
case (w, (p, n)) => w
}
.map {
Beautiful,
no?
Thursday, May 1, 14
flatMap  maps  over  each  of  these  2-­‐element  tuples.  We  split  the  text  into  words  on  non-­‐alphanumeric  characters,  then  output  collec?ons  of  word  
(our  ul?mate,  final  “key”)  and  the  path.  Each  line  is  converted  to  a  collec?on  of  (word,path)  pairs,  so  flatMap  converts  the  collec?on  of  collec?ons  
into  one  long  “flat”  collec?on  of  (word,path)  pairs.
Then  we  map  over  these  pairs  and  add  a  single  count  of  1.
Thursday, May 1, 14
Another  example  of  a  beau?ful  and  profound  DSL,  in  this  case  from  the  world  of  Physics:  Maxwell’s  equa?ons:  hBp://upload.wikimedia.org/
wikipedia/commons/c/c4/Maxwell'sEqua?ons.svg
}
.reduceByKey {
case (n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupBy {
case (w, (p, n)) => w
}
.map {
case (w, seq) =>
val seq2 = seq map {
case (_, (p, n)) => (p, n)
}
(w, seq2.mkString(", "))
}
.saveAsTextFile(argz.outpath)
sc.stop()
(word1,  (path1,  n1))
(word2,  (path2,  n2))
...
Thursday, May 1, 14
reduceByKey  does  an  implicit  “group  by”  to  bring  together  all  occurrences  of  the  same  (word,  path)  and  then  sums  up  their  counts.
Note  the  input  to  the  next  map  is  now  ((word,  path),  n),  where  n  is  now  >=  1.  We  transform  these  tuples  into  the  form  we  actually  want,  (word,  
(path,  n)).
}
.groupBy {
case (w, (p, n)) => w
}
.map {
case (w, seq) =>
val seq2 = seq map {
case (_, (p, n)) => (p, n)
}
(w, seq2.mkString(", "))
}
.saveAsTextFile(argz.outpath)
sc.stop()
}
}
(word,  seq((word,  (path1,  n1)),  (word,  (path2,  n2)),  ...))
(word,  “(path1,  n1),  (path2,  n2),  ...”)
Thursday, May 1, 14
Now  we  do  an  explicit  group  by  to  bring  all  the  same  words  together.  The  output    will  be  (word,  (word,  (path1,  n1)),  (word,  (path2,  n2)),  ...).
The  last  map  removes  the  redundant  “word”  values  in  the  sequences  of  the  previous  output.  It  outputs  the  sequence  as  a  final  string  of  comma-­‐
separated  (path,n)  pairs.
We  finish  by  saving  the  output  as  text  file(s)  and  stopping  the  workflow.  
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object InvertedIndex {
def main(args: Array[String]) = {
val sc = new SparkContext(
"local", "Inverted Index")
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
case (n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupBy {
case (w, (p, n)) => w
}
.map {
case (w, seq) =>
val seq2 = seq map {
case (_, (p, n)) => (p, n)
}
(w, seq2.mkString(", "))
}
.saveAsTextFile(argz.outpath)
sc.stop()
}
}
Altogether
Thursday, May 1, 14
The  whole  shebang  (12pt.  font,  this  ?me)
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey{
case (n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
Powerful
Combinators!
Thursday, May 1, 14
I’ve  shortened  the  original  code  a  bit,  e.g.,  using  *  import  statements  instead  of  separate  imports  for  each  class.
I’m  not  going  to  explain  every  line  ...  nor  most  lines.
Everything  is  in  one  outer  class.  We  start  with  a  main  rou?ne  that  sets  up  the  job.  LoBa  boilerplate...
The Spark version
took me ~30 minutes
to write.
Thursday, May 1, 14
Once  you  learn  the  core  primiEves  I  used,  and  a  few  tricks  for  manipulaEng  the  RDD  tuples,  you  can  very  quickly  build  complex  algorithms  
for  data  processing!
The  Spark  API  allowed  us  to  focus  almost  exclusively  on  the  “domain”  of  data  transformaEons,  while  the  Java  MapReduce  version  (which  
does  less),  forced  tedious  aPenEon  to  infrastructure  mechanics.  
Shark:
Hive (SQL query tool)
ported to Spark.
SQL!
Thursday, May 1, 14
Use SQL when you can!
Use a SQL query when
you can!!
Thursday, May 1, 14
CREATE EXTERNAL TABLE stocks (
symbol STRING,
ymd STRING,
price_open STRING,
price_close STRING,
shares_traded INT)
LOCATION “hdfs://data/stocks”;
-- Year-over-year average closing price.
SELECT year(ymd), avg(price_close)
FROM stocks
WHERE symbol = ‘AAPL’
GROUP BY year(ymd);
Hive  Query  
Language.
Thursday, May 1, 14
Hive  and  the  Spark  port,  Shark,  let  you  use  SQL  to  query  and  manipulate  structured  and  semistructured  data.  By  default,  tab-­‐delimited  text  files  
will  be  assumed  for  the  files  found  in  “data/stocks”.  Alterna?ves  can  be  configured  as  part  of  the  table  metadata.
~10-100x the performance of
Hive, due to in-memory
caching of RDDs & better
Spark abstractions.
Shark
Thursday, May 1, 14
Spark SQL:
Next generation
SQL query tool/API.
Did I mention SQL?
Thursday, May 1, 14
A new query optimizer. It will become the basis for Shark internals, replacing messing Hive code that is hard to reason about, extend, and
debug.
Combine Shark SQL
queries with Machine
Learning code.
Thursday, May 1, 14
We’ll  use  the  Spark  “MLlib”  in  the  example,  then  return  to  it  in  a  moment.
CREATE TABLE Users(
userId INT,
name STRING,
email STRING,
age INT,
latitude DOUBLE,
longitude DOUBLE,
subscribed BOOLEAN);
CREATE TABLE Events(
userId INT,
action INT);
Hive/Shark  table  
definiEons
(not  Scala).
Thursday, May 1, 14
This  example  adapted  from  the  following  blog  post  announcing  Spark  SQL:
hBp://databricks.com/blog/2014/03/26/Spark-­‐SQL-­‐manipula?ng-­‐structured-­‐data-­‐using-­‐Spark.html
Assume  we  have  these  Hive/Shark  tables,  with  data  about  our  users  and  events  that  have  occurred.
val trainingDataTable = sql("""
SELECT e.action, u.age,
u.latitude, u.longitude
FROM Users u
JOIN Events e
ON u.userId = e.userId""")
val trainingData =
trainingDataTable map { row =>
val features =
Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model =
new LogisticRegressionWithSGD()
.run(trainingData)
val allCandidates = sql("""
Spark  Scala.
Thursday, May 1, 14
Here  is  some  Spark  (Scala)  code  with  an  embedded  SQL/Shark  query  that  joins  the  Users  and  Events  tables.  The  “””...”””  string  allows  embedded  
line  feeds.
The  “sql”  func?on  returns  an  RDD,  which  we  then  map  over  to  create  LabeledPoints,  an  object  used  in  Spark’s  MLlib  (machine  learning  library)  for  
a  recommenda?on  engine.  The  “label”  is  the  kind  of  event  and  the  user’s  age  and  lat/long  coordinates  are  the  “features”  used  for  making  
recommenda?ons.  (E.g.,  if  you’re  25  and  near  a  certain  loca?on  in  the  city,  you  might  be  interested  a  nightclub  near  by...)
val model =
new LogisticRegressionWithSGD()
.run(trainingData)
val allCandidates = sql("""
SELECT userId, age, latitude, longitude
FROM Users
WHERE subscribed = FALSE""")
case class Score(
userId: Int, score: Double)
val scores = allCandidates map { row =>
val features =
Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
// Hive table
scores.registerAsTable("Scores")
Thursday, May 1, 14
Next  we  train  the  recommenda?on  engine,  using  a  “logis?c  regression”  fit  to  the  training  data,  where  “stochas?c  gradient  descent”  (SGD)  is  used  
to  train  it.  (This  is  a  standard  tool  set  for  recommenda?on  engines;  see  for  example:  hBp://www.cs.cmu.edu/~wcohen/10-­‐605/assignments/
sgd.pdf)
.run(trainingData)
val allCandidates = sql("""
SELECT userId, age, latitude, longitude
FROM Users
WHERE subscribed = FALSE""")
case class Score(
userId: Int, score: Double)
val scores = allCandidates map { row =>
val features =
Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
// Hive table
scores.registerAsTable("Scores")
val topCandidates = sql("""
SELECT u.name, u.email
FROM Scores s
Thursday, May 1, 14
Now  run  a  query  against  all  users  who  aren’t  already  subscribed  to  no?fica?ons.
case class Score(
userId: Int, score: Double)
val scores = allCandidates map { row =>
val features =
Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
// Hive table
scores.registerAsTable("Scores")
val topCandidates = sql("""
SELECT u.name, u.email
FROM Scores s
JOIN Users u ON s.userId = u.userId
ORDER BY score DESC
LIMIT 100""")Thursday, May 1, 14
Declare  a  class  to  hold  each  user’s  “score”  as  produced  by  the  recommenda?on  engine  and  map  the  “all”  query  results  to  Scores.
Then  “register”  the  scores  RDD  as  a  “Scores”  table  in  Hive’s  metadata  respository.  This  is  equivalent  to  running  a  “CREATE  TABLE  Scores  ...”  
command  at  the  Hive/Shark  prompt!
// Hive table
scores.registerAsTable("Scores")
val topCandidates = sql("""
SELECT u.name, u.email
FROM Scores s
JOIN Users u ON s.userId = u.userId
ORDER BY score DESC
LIMIT 100""")
Thursday, May 1, 14
Finally,  run  a  new  query  to  find  the  people  with  the  highest  scores  that  aren’t  already  subscribing  to  no?fica?ons.  You  might  send  them  an  email  
next  recommending  that  they  subscribe...
Spark Streaming:
Use the same abstractions for
real-time, event streaming.
Cluster Computing
Thursday, May 1, 14
Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little
code.
Spark Job
Iteration #1
1 second
Data Stream…
Spark Job
Iteration #2
Thursday, May 1, 14
You can specify the granularity, such as all events in 1 second windows, then your Spark job is patched each window of data for
processing.
Very similar code...
Thursday, May 1, 14
val sc = new SparkContext(...)
val ssc = new StreamingContext(
sc, Seconds(1))
// A DStream that will listen to server:port
val lines =
ssc.socketTextStream(server, port)
// Word Count...
val words = lines flatMap {
line => line.split("""W+""")
}
val pairs = words map (word => (word, 1))
val wordCounts =
pairs reduceByKey ((n1, n2) => n1 + n2)
wordCount.print() // print a few counts...
Thursday, May 1, 14
This  example  adapted  from  the  following  page  on  the  Spark  website:
hBp://spark.apache.org/docs/0.9.0/streaming-­‐programming-­‐guide.html#a-­‐quick-­‐example
We  create  a  StreamingContext  that  wraps  a  SparkContext  (there  are  alterna?ve  ways  to  construct  it...).  It  will  “clump”  the  events  into  1-­‐second  
intervals.  Next  we  setup  a  socket  to  stream  text  to  us  from  another  server  and  port  (one  of  several  ways  to  ingest  data).
ssc.socketTextStream(server, port)
// Word Count...
val words = lines flatMap {
line => line.split("""W+""")
}
val pairs = words map (word => (word, 1))
val wordCounts =
pairs reduceByKey ((n1, n2) => n1 + n2)
wordCount.print() // print a few counts...
ssc.start()
ssc.awaitTermination()
Thursday, May 1, 14
Now  the  “word  count”  happens  over  each  interval  (aggrega?on  across  intervals  is  also  possible),  but  otherwise  it  works  like  before.
Once  we  setup  the  flow,  we  start  it  and  wait  for  it  to  terminate  through  some  means,  such  as  the  server  socket  closing.
MLlib:
Machine learning library.
Cluster Computing
Thursday, May 1, 14
Spark implements many machine learning algorithms, although a lot more or needed, compared to more mature tools like Mahout and
libraries in Python and R.
•Linear regression
•Binary classification
•Collaborative filtering
•Clustering
•Others...
MLlib
Thursday, May 1, 14
Not as full-featured as more mature toolkits, but the Mahout project has announced they are going to port their algorithms to Spark, which
include powerful Mathematics, e.g., Matrix support libraries.
GraphX:
Graphical models
and algorithms.
Cluster Computing
Thursday, May 1, 14
Some problems are more naturally represented as graphs.
Extends RDDs to support property graphs with directed edges.
Thursday, May 1, 14
Why is Spark so good (and Java MapReduce so bad)? Because fundamentally, data analytics is Mathematics and programming tools inspired
by Mathematics - like Functional Programming - are ideal tools for working with data. This is why Spark code is so concise, yet powerful. This
is why it is a great platform for performance optimizations. This is why Spark is a great platform for higher-level tools, like SQL, graphs, etc.
Interest in FP started growing ~10 years ago as a tool to attack concurrency. I believe that data is now driving FP adoption even faster. I know
many Java shops that switched to Scala when they adopted tools like Spark and Scalding (https://github.com/twitter/scalding).
A flexible, scalable distributed
compute platform with
concise, powerful APIs and
higher-order tools.
spark.apache.org
Spark
Thursday, May 1, 14
Why Spark Is
the Next Top
(Compute)
Model
@deanwampler
polyglotprogramming.com/talks
Thursday, May 1, 14
Copyright (c) 2014, Dean Wampler, All Rights Reserved, unless otherwise noted.
Image: The London Eye on one side of the Thames, Parliament on the other.

More Related Content

What's hot

Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseRachel Warren
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
 
Understanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New YorkUnderstanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New YorkRachel Warren
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk finalRachel Warren
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSpraveen bhat
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 

What's hot (20)

Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
Understanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New YorkUnderstanding Spark Tuning: Strata New York
Understanding Spark Tuning: Strata New York
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk final
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 

Viewers also liked

Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelDean Wampler
 
Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Sriram Krishnan
 
Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadershipsjoerdluteyn
 
Spark, the new age of data scientist
Spark, the new age of data scientistSpark, the new age of data scientist
Spark, the new age of data scientistMassimiliano Martella
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinesecolorant
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Sciencesarith divakar
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)Hadoop / Spark Conference Japan
 
Pixie dust overview
Pixie dust overviewPixie dust overview
Pixie dust overviewDavid Taieb
 
Why dont you_create_new_spark_jl
Why dont you_create_new_spark_jlWhy dont you_create_new_spark_jl
Why dont you_create_new_spark_jlShintaro Fukushima
 
Spark tutorial py con 2016 part 2
Spark tutorial py con 2016   part 2Spark tutorial py con 2016   part 2
Spark tutorial py con 2016 part 2David Taieb
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged ApplicationsMapR Technologies
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
 
Spark tutorial pycon 2016 part 1
Spark tutorial pycon 2016   part 1Spark tutorial pycon 2016   part 1
Spark tutorial pycon 2016 part 1David Taieb
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016StampedeCon
 

Viewers also liked (20)

Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014
 
Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadership
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Performance
PerformancePerformance
Performance
 
Spark, the new age of data scientist
Spark, the new age of data scientistSpark, the new age of data scientist
Spark, the new age of data scientist
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinese
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
 
Pixie dust overview
Pixie dust overviewPixie dust overview
Pixie dust overview
 
Why dont you_create_new_spark_jl
Why dont you_create_new_spark_jlWhy dont you_create_new_spark_jl
Why dont you_create_new_spark_jl
 
Spark in 15 min
Spark in 15 minSpark in 15 min
Spark in 15 min
 
Spark tutorial py con 2016 part 2
Spark tutorial py con 2016   part 2Spark tutorial py con 2016   part 2
Spark tutorial py con 2016 part 2
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged Applications
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Spark tutorial pycon 2016 part 1
Spark tutorial pycon 2016   part 1Spark tutorial pycon 2016   part 1
Spark tutorial pycon 2016 part 1
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
 

Similar to Spark the next top compute model

10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoopRexRamos9
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 Andrey Vykhodtsev
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 

Similar to Spark the next top compute model (20)

10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoop
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Tutorial5
Tutorial5Tutorial5
Tutorial5
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 

Recently uploaded

Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 

Recently uploaded (20)

Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 

Spark the next top compute model

  • 1. Why Spark Is the Next Top (Compute) Model Philly  ETE  2014 April  22-­‐23,  2014 @deanwampler polyglotprogramming.com/talks Thursday, May 1, 14 Copyright (c) 2014, Dean Wampler, All Rights Reserved, unless otherwise noted. Image: Detail of the London Eye
  • 2. Dean Wampler dean.wampler@typesafe.com polyglotprogramming.com/talks @deanwampler Thursday, May 1, 14 About me. You can find this presentation and others on Big Data and Scala at polyglotprogramming.com. Programming Scala, 2nd Edition is forthcoming. photo: Dusk at 30,000 ft above the Central Plains of the U.S. on a Winter’s Day.
  • 3. Or this? Thursday, May 1, 14 photo: https://twitter.com/john_overholt/status/447431985750106112/photo/1
  • 4. Hadoop circa 2013 Thursday, May 1, 14 The state of Hadoop as of last year. Image: Detail of the London Eye
  • 5. Hadoop v2.X Cluster node DiskDiskDiskDiskDisk Node Mgr Data Node node DiskDiskDiskDiskDisk Node Mgr Data Node node DiskDiskDiskDiskDisk Node Mgr Data Node master Resource Mgr Name Node master Name Node Thursday, May 1, 14 Schematic view of a Hadoop 2 cluster. For a more precise definition of the services and what they do, see e.g., http://hadoop.apache.org/ docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html We aren’t interested in great details at this point, but we’ll call out a few useful things to know.
  • 6. Hadoop v2.X Cluster node DiskDiskDiskDiskDisk Node Mgr Data Node node DiskDiskDiskDiskDisk Node Mgr Data Node node DiskDiskDiskDiskDisk Node Mgr Data Node master Resource Mgr Name Node master Name Node Resource and Node Managers Thursday, May 1, 14 Hadoop 2 uses YARN to manage resources via the master Resource Manager, which includes a pluggable job scheduler and an Applications Manager. It coordinates with the Node Manager on each node to schedule jobs and provide resources. Other services involved, including application-specific Containers and Application Masters are not shown.
  • 7. Hadoop v2.X Cluster node DiskDiskDiskDiskDisk Node Mgr Data Node node DiskDiskDiskDiskDisk Node Mgr Data Node node DiskDiskDiskDiskDisk Node Mgr Data Node master Resource Mgr Name Node master Name Node Name Node and Data Nodes Thursday, May 1, 14 Hadoop 2 clusters federate the Name node services that manage the file system, HDFS. They provide horizontal scalability of file-system operations and resiliency when service instances fail. The data node services manage individual blocks for files.
  • 8. MapReduce The classic compute model for Hadoop Thursday, May 1, 14 Hadoop 2 clusters federate the Name node services that manage the file system, HDFS. They provide horizontal scalability of file-system operations and resiliency when service instances fail. The data node services manage individual blocks for files.
  • 9. 1 map step + 1 reduce step (wash, rinse, repeat) MapReduce Thursday, May 1, 14 You get 1 map step (although there is limited support for chaining mappers) and 1 reduce step. If you can’t implement an algorithm in these two steps, you can chain jobs together, but you’ll pay a tax of flushing the entire data set to disk between these jobs.
  • 10. Example: Inverted Index MapReduce Thursday, May 1, 14 The inverted index is a classic algorithm needed for building search engines.
  • 11. Thursday, May 1, 14 Before running MapReduce, crawl teh interwebs, find all the pages, and build a data set of URLs -> doc contents, written to flat files in HDFS or one of the more “sophisticated” formats.
  • 12. Thursday, May 1, 14 Now we’re running MapReduce. In the map step, a task (JVM process) per file *block* (64MB or larger) reads the rows, tokenizes the text and outputs key-value pairs (“tuples”)...
  • 13. Map Task (hadoop,(wikipedia.org/hadoop,1)) (mapreduce,(wikipedia.org/hadoop, 1)) (hdfs,(wikipedia.org/hadoop, 1)) (provides,(wikipedia.org/hadoop,1)) (and,(wikipedia.org/hadoop,1)) Thursday, May 1, 14 ... the keys are each word found and the values are themselves tuples, each URL and the count of the word. In our simplified example, there are typically only single occurrences of each work in each document. The real occurrences are interesting because if a word is mentioned a lot in a document, the chances are higher that you would want to find that document in a search for that word.
  • 15. Thursday, May 1, 14 The output tuples are sorted by key locally in each map task, then “shuffled” over the cluster network to reduce tasks (each a JVM process, too), where we want all occurrences of a given key to land on the same reduce task.
  • 16. Thursday, May 1, 14 Finally, each reducer just aggregates all the values it receives for each key, then writes out new files to HDFS with the words and a list of (URL-count) tuples (pairs).
  • 17. Altogether... Thursday, May 1, 14 Finally, each reducer just aggregates all the values it receives for each key, then writes out new files to HDFS with the words and a list of (URL-count) tuples (pairs).
  • 18. What’s not to like?Thursday, May 1, 14 This seems okay, right? What’s wrong with it?
  • 19. Most algorithms are much harder to implement in this restrictive map-then-reduce model. Awkward Thursday, May 1, 14 Writing MapReduce jobs requires arcane, specialized skills that few master. For a good overview, see http://lintool.github.io/ MapReduceAlgorithms/.
  • 20. Lack of flexibility inhibits optimizations, too. Awkward Thursday, May 1, 14 The inflexible compute model leads to complex code to improve performance where hacks are used to work around the limitations. Hence, optimizations are hard to implement. The Spark team has commented on this, see http://databricks.com/blog/2014/03/26/Spark-SQL- manipulating-structured-data-using-Spark.html
  • 21. Full dump to disk between jobs. Performance Thursday, May 1, 14 Sequencing jobs wouldn’t be so bad if the “system” was smart enough to cache data in memory. Instead, each job dumps everything to disk, then the next job reads it back in again. This makes iterative algorithms particularly painful.
  • 23. Can be run in: •YARN (Hadoop 2) •Mesos (Cluster management) •EC2 •Standalone mode Cluster Computing Thursday, May 1, 14 If you have a Hadoop cluster, you can run Spark as a seamless compute engine on YARN. (You can also use pre-YARN Hadoop v1 clusters, but there you have manually allocate resources to the embedded Spark cluster vs Hadoop.) Mesos is a general-purpose cluster resource manager that can also be used to manage Hadoop resources. Scripts for running a Spark cluster in EC2 are available. Standalone just means you run Spark’s built-in support for clustering (or run locally on a single box - e.g., for development). EC2 deployments are usually standalone...
  • 24. Fine-grained “combinators” for composing algorithms. Compute Model Thursday, May 1, 14 Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little code.
  • 25. RDDs: Resilient, Distributed Datasets Compute Model Thursday, May 1, 14 RDDs shard the data over a cluster, like a virtualized, distributed collection (analogous to HDFS). They support intelligent caching, which means no naive flushes of massive datasets to disk. This feature alone allows Spark jobs to run 10-100x faster than comparable MapReduce jobs! The “resilient” part means they will reconstitute shards lost due to process/server crashes.
  • 26. Written in Scala, with Java and Python APIs. Compute Model Thursday, May 1, 14 Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little code.
  • 27. Inverted Index in MapReduce (Java). Thursday, May 1, 14 Let’s  see  an  an  actual  implementaEon  of  the  inverted  index.  First,  a  Hadoop  MapReduce  (Java)  version,  adapted  from  hPps:// developer.yahoo.com/hadoop/tutorial/module4.html#soluEon  It’s  about  90  lines  of  code,  but  I  reformaPed  to  fit  bePer. This  is  also  a  slightly  simpler  version  that  the  one  I  diagrammed.  It  doesn’t  record  a  count  of  each  word  in  a  document;  it  just  writes   (word,doc-­‐Etle)  pairs  out  of  the  mappers  and  the  final  (word,list)  output  by  the  reducers  just  has  a  list  of  documentaEons,  hence  repeats.   A  second  job  would  be  necessary  to  count  the  repeats.
  • 28. import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class LineIndexer { public static void main(String[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(LineIndexer.class); conf.setJobName("LineIndexer"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); FileInputFormat.addInputPath(conf, new Path("input")); Thursday, May 1, 14 I’ve  shortened  the  original  code  a  bit,  e.g.,  using  *  import  statements  instead  of  separate  imports  for  each  class. I’m  not  going  to  explain  every  line  ...  nor  most  lines. Everything  is  in  one  outer  class.  We  start  with  a  main  rou?ne  that  sets  up  the  job.  LoBa  boilerplate... I  used  yellow  for  method  calls,  because  methods  do  the  real  work!!  But  no?ce  that  the  func?ons  in  this  code  don’t  really  do  a  whole  lot...
  • 29. JobClient client = new JobClient(); JobConf conf = new JobConf(LineIndexer.class); conf.setJobName("LineIndexer"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); FileInputFormat.addInputPath(conf, new Path("input")); FileOutputFormat.setOutputPath(conf, new Path("output")); conf.setMapperClass( LineIndexMapper.class); conf.setReducerClass( LineIndexReducer.class); client.setConf(conf); try { JobClient.runJob(conf); Thursday, May 1, 14 boilerplate...
  • 30. LineIndexMapper.class); conf.setReducerClass( LineIndexReducer.class); client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } } public static class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private final static Text word = new Text(); private final static Text location = Thursday, May 1, 14 main ends with a try-catch clause to run the job.
  • 31. public static class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private final static Text word = new Text(); private final static Text location = new Text(); public void map( LongWritable key, Text val, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName(); location.set(fileName); Thursday, May 1, 14 This is the LineIndexMapper class for the mapper. The map method does the real work of tokenization and writing the (word, document-name) tuples.
  • 32. FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName(); location.set(fileName); String line = val.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, location); } } } public static class LineIndexReducer extends MapReduceBase implements Reducer<Text, Text, Thursday, May 1, 14 The rest of the LineIndexMapper class and map method.
  • 33. public static class LineIndexReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { boolean first = true; StringBuilder toReturn = new StringBuilder(); while (values.hasNext()) { if (!first) toReturn.append(", "); first=false; toReturn.append( values.next().toString()); } output.collect(key, new Text(toReturn.toString())); Thursday, May 1, 14 The reducer class, LineIndexReducer, with the reduce method that is called for each key and a list of values for that key. The reducer is stupid; it just reformats the values collection into a long string and writes the final (word,list-string) output.
  • 34. boolean first = true; StringBuilder toReturn = new StringBuilder(); while (values.hasNext()) { if (!first) toReturn.append(", "); first=false; toReturn.append( values.next().toString()); } output.collect(key, new Text(toReturn.toString())); } } } Thursday, May 1, 14 EOF
  • 35. Altogether import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class LineIndexer { public static void main(String[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(LineIndexer.class); conf.setJobName("LineIndexer"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); FileInputFormat.addInputPath(conf, new Path("input")); FileOutputFormat.setOutputPath(conf, new Path("output")); conf.setMapperClass( LineIndexMapper.class); conf.setReducerClass( LineIndexReducer.class); client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } } public static class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private final static Text word = new Text(); private final static Text location = new Text(); public void map( LongWritable key, Text val, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName(); location.set(fileName); String line = val.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, location); } } } public static class LineIndexReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { boolean first = true; StringBuilder toReturn = new StringBuilder(); while (values.hasNext()) { if (!first) toReturn.append(", "); first=false; toReturn.append( values.next().toString()); } output.collect(key, new Text(toReturn.toString())); } } } Thursday, May 1, 14 The  whole  shebang  (6pt.  font)
  • 36. Inverted Index in Spark (Scala). Thursday, May 1, 14 This  code  is  approximately  45  lines,  but  it  does  more  than  the  previous  Java  example,  it  implements  the  original  inverted  index  algorithm  I   diagrammed  where  word  counts  are  computed  and  included  in  the  data.
  • 37. import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object InvertedIndex { def main(args: Array[String]) = { val sc = new SparkContext( "local", "Inverted Index") sc.textFile("data/crawl") .map { line => val array = line.split("t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } Thursday, May 1, 14 It  starts  with  imports,  then  declares  a  singleton  object  (a  first-­‐class  concept  in  Scala),  with  a  main  rou?ne  (as  in  Java). The  methods  are  colored  yellow  again.  Note  this  ?me  how  dense  with  meaning  they  are  this  ?me.
  • 38. val sc = new SparkContext( "local", "Inverted Index") sc.textFile("data/crawl") .map { line => val array = line.split("t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { case (n1, n2) => n1 + n2 } Thursday, May 1, 14 You  being  the  workflow  by  declaring  a  SparkContext  (in  “local”  mode,  in  this  case).  The  rest  of  the  program  is  a  sequence  of  func?on  calls,   analogous  to  “pipes”  we  connect  together  to  perform  the  data  flow. Next  we  read  one  or  more  text  files.  If  “data/crawl”  has  1  or  more  Hadoop-­‐style  “part-­‐NNNNN”  files,  Spark  will  process  all  of  them  (in  parallel  if   running  a  distributed  configura?on;  they  will  be  processed  synchronously  in  local  mode).
  • 39. .map { line => val array = line.split("t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { case (n1, n2) => n1 + n2 } .map { case ((w, p), n) => (w, (p, n)) } .groupBy { Thursday, May 1, 14 sc.textFile  returns  an  RDD  with  a  string  for  each  line  of  input  text.  So,  the  first  thing  we  do  is  map  over  these  strings  to  extract  the  original   document  id  (i.e.,  file  name),  followed  by  the  text  in  the  document,  all  on  one  line.  We  assume  tab  is  the  separator.  “(array(0),  array(1))”  returns  a   two-­‐element  “tuple”.  Think  of  the  output  RDD  has  having  a  schema  “String  fileName,  String  text”.    
  • 40. } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { case (n1, n2) => n1 + n2 } .map { case ((w, p), n) => (w, (p, n)) } .groupBy { case (w, (p, n)) => w } .map { Beautiful, no? Thursday, May 1, 14 flatMap  maps  over  each  of  these  2-­‐element  tuples.  We  split  the  text  into  words  on  non-­‐alphanumeric  characters,  then  output  collec?ons  of  word   (our  ul?mate,  final  “key”)  and  the  path.  Each  line  is  converted  to  a  collec?on  of  (word,path)  pairs,  so  flatMap  converts  the  collec?on  of  collec?ons   into  one  long  “flat”  collec?on  of  (word,path)  pairs. Then  we  map  over  these  pairs  and  add  a  single  count  of  1.
  • 41. Thursday, May 1, 14 Another  example  of  a  beau?ful  and  profound  DSL,  in  this  case  from  the  world  of  Physics:  Maxwell’s  equa?ons:  hBp://upload.wikimedia.org/ wikipedia/commons/c/c4/Maxwell'sEqua?ons.svg
  • 42. } .reduceByKey { case (n1, n2) => n1 + n2 } .map { case ((w, p), n) => (w, (p, n)) } .groupBy { case (w, (p, n)) => w } .map { case (w, seq) => val seq2 = seq map { case (_, (p, n)) => (p, n) } (w, seq2.mkString(", ")) } .saveAsTextFile(argz.outpath) sc.stop() (word1,  (path1,  n1)) (word2,  (path2,  n2)) ... Thursday, May 1, 14 reduceByKey  does  an  implicit  “group  by”  to  bring  together  all  occurrences  of  the  same  (word,  path)  and  then  sums  up  their  counts. Note  the  input  to  the  next  map  is  now  ((word,  path),  n),  where  n  is  now  >=  1.  We  transform  these  tuples  into  the  form  we  actually  want,  (word,   (path,  n)).
  • 43. } .groupBy { case (w, (p, n)) => w } .map { case (w, seq) => val seq2 = seq map { case (_, (p, n)) => (p, n) } (w, seq2.mkString(", ")) } .saveAsTextFile(argz.outpath) sc.stop() } } (word,  seq((word,  (path1,  n1)),  (word,  (path2,  n2)),  ...)) (word,  “(path1,  n1),  (path2,  n2),  ...”) Thursday, May 1, 14 Now  we  do  an  explicit  group  by  to  bring  all  the  same  words  together.  The  output    will  be  (word,  (word,  (path1,  n1)),  (word,  (path2,  n2)),  ...). The  last  map  removes  the  redundant  “word”  values  in  the  sequences  of  the  previous  output.  It  outputs  the  sequence  as  a  final  string  of  comma-­‐ separated  (path,n)  pairs. We  finish  by  saving  the  output  as  text  file(s)  and  stopping  the  workflow.  
  • 44. import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object InvertedIndex { def main(args: Array[String]) = { val sc = new SparkContext( "local", "Inverted Index") sc.textFile("data/crawl") .map { line => val array = line.split("t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { case (n1, n2) => n1 + n2 } .map { case ((w, p), n) => (w, (p, n)) } .groupBy { case (w, (p, n)) => w } .map { case (w, seq) => val seq2 = seq map { case (_, (p, n)) => (p, n) } (w, seq2.mkString(", ")) } .saveAsTextFile(argz.outpath) sc.stop() } } Altogether Thursday, May 1, 14 The  whole  shebang  (12pt.  font,  this  ?me)
  • 45. sc.textFile("data/crawl") .map { line => val array = line.split("t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey{ case (n1, n2) => n1 + n2 } .map { case ((w, p), n) => (w, (p, n)) } Powerful Combinators! Thursday, May 1, 14 I’ve  shortened  the  original  code  a  bit,  e.g.,  using  *  import  statements  instead  of  separate  imports  for  each  class. I’m  not  going  to  explain  every  line  ...  nor  most  lines. Everything  is  in  one  outer  class.  We  start  with  a  main  rou?ne  that  sets  up  the  job.  LoBa  boilerplate...
  • 46. The Spark version took me ~30 minutes to write. Thursday, May 1, 14 Once  you  learn  the  core  primiEves  I  used,  and  a  few  tricks  for  manipulaEng  the  RDD  tuples,  you  can  very  quickly  build  complex  algorithms   for  data  processing! The  Spark  API  allowed  us  to  focus  almost  exclusively  on  the  “domain”  of  data  transformaEons,  while  the  Java  MapReduce  version  (which   does  less),  forced  tedious  aPenEon  to  infrastructure  mechanics.  
  • 47. Shark: Hive (SQL query tool) ported to Spark. SQL! Thursday, May 1, 14 Use SQL when you can!
  • 48. Use a SQL query when you can!! Thursday, May 1, 14
  • 49. CREATE EXTERNAL TABLE stocks ( symbol STRING, ymd STRING, price_open STRING, price_close STRING, shares_traded INT) LOCATION “hdfs://data/stocks”; -- Year-over-year average closing price. SELECT year(ymd), avg(price_close) FROM stocks WHERE symbol = ‘AAPL’ GROUP BY year(ymd); Hive  Query   Language. Thursday, May 1, 14 Hive  and  the  Spark  port,  Shark,  let  you  use  SQL  to  query  and  manipulate  structured  and  semistructured  data.  By  default,  tab-­‐delimited  text  files   will  be  assumed  for  the  files  found  in  “data/stocks”.  Alterna?ves  can  be  configured  as  part  of  the  table  metadata.
  • 50. ~10-100x the performance of Hive, due to in-memory caching of RDDs & better Spark abstractions. Shark Thursday, May 1, 14
  • 51. Spark SQL: Next generation SQL query tool/API. Did I mention SQL? Thursday, May 1, 14 A new query optimizer. It will become the basis for Shark internals, replacing messing Hive code that is hard to reason about, extend, and debug.
  • 52. Combine Shark SQL queries with Machine Learning code. Thursday, May 1, 14 We’ll  use  the  Spark  “MLlib”  in  the  example,  then  return  to  it  in  a  moment.
  • 53. CREATE TABLE Users( userId INT, name STRING, email STRING, age INT, latitude DOUBLE, longitude DOUBLE, subscribed BOOLEAN); CREATE TABLE Events( userId INT, action INT); Hive/Shark  table   definiEons (not  Scala). Thursday, May 1, 14 This  example  adapted  from  the  following  blog  post  announcing  Spark  SQL: hBp://databricks.com/blog/2014/03/26/Spark-­‐SQL-­‐manipula?ng-­‐structured-­‐data-­‐using-­‐Spark.html Assume  we  have  these  Hive/Shark  tables,  with  data  about  our  users  and  events  that  have  occurred.
  • 54. val trainingDataTable = sql(""" SELECT e.action, u.age, u.latitude, u.longitude FROM Users u JOIN Events e ON u.userId = e.userId""") val trainingData = trainingDataTable map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new LogisticRegressionWithSGD() .run(trainingData) val allCandidates = sql(""" Spark  Scala. Thursday, May 1, 14 Here  is  some  Spark  (Scala)  code  with  an  embedded  SQL/Shark  query  that  joins  the  Users  and  Events  tables.  The  “””...”””  string  allows  embedded   line  feeds. The  “sql”  func?on  returns  an  RDD,  which  we  then  map  over  to  create  LabeledPoints,  an  object  used  in  Spark’s  MLlib  (machine  learning  library)  for   a  recommenda?on  engine.  The  “label”  is  the  kind  of  event  and  the  user’s  age  and  lat/long  coordinates  are  the  “features”  used  for  making   recommenda?ons.  (E.g.,  if  you’re  25  and  near  a  certain  loca?on  in  the  city,  you  might  be  interested  a  nightclub  near  by...)
  • 55. val model = new LogisticRegressionWithSGD() .run(trainingData) val allCandidates = sql(""" SELECT userId, age, latitude, longitude FROM Users WHERE subscribed = FALSE""") case class Score( userId: Int, score: Double) val scores = allCandidates map { row => val features = Array[Double](row(1), row(2), row(3)) Score(row(0), model.predict(features)) } // Hive table scores.registerAsTable("Scores") Thursday, May 1, 14 Next  we  train  the  recommenda?on  engine,  using  a  “logis?c  regression”  fit  to  the  training  data,  where  “stochas?c  gradient  descent”  (SGD)  is  used   to  train  it.  (This  is  a  standard  tool  set  for  recommenda?on  engines;  see  for  example:  hBp://www.cs.cmu.edu/~wcohen/10-­‐605/assignments/ sgd.pdf)
  • 56. .run(trainingData) val allCandidates = sql(""" SELECT userId, age, latitude, longitude FROM Users WHERE subscribed = FALSE""") case class Score( userId: Int, score: Double) val scores = allCandidates map { row => val features = Array[Double](row(1), row(2), row(3)) Score(row(0), model.predict(features)) } // Hive table scores.registerAsTable("Scores") val topCandidates = sql(""" SELECT u.name, u.email FROM Scores s Thursday, May 1, 14 Now  run  a  query  against  all  users  who  aren’t  already  subscribed  to  no?fica?ons.
  • 57. case class Score( userId: Int, score: Double) val scores = allCandidates map { row => val features = Array[Double](row(1), row(2), row(3)) Score(row(0), model.predict(features)) } // Hive table scores.registerAsTable("Scores") val topCandidates = sql(""" SELECT u.name, u.email FROM Scores s JOIN Users u ON s.userId = u.userId ORDER BY score DESC LIMIT 100""")Thursday, May 1, 14 Declare  a  class  to  hold  each  user’s  “score”  as  produced  by  the  recommenda?on  engine  and  map  the  “all”  query  results  to  Scores. Then  “register”  the  scores  RDD  as  a  “Scores”  table  in  Hive’s  metadata  respository.  This  is  equivalent  to  running  a  “CREATE  TABLE  Scores  ...”   command  at  the  Hive/Shark  prompt!
  • 58. // Hive table scores.registerAsTable("Scores") val topCandidates = sql(""" SELECT u.name, u.email FROM Scores s JOIN Users u ON s.userId = u.userId ORDER BY score DESC LIMIT 100""") Thursday, May 1, 14 Finally,  run  a  new  query  to  find  the  people  with  the  highest  scores  that  aren’t  already  subscribing  to  no?fica?ons.  You  might  send  them  an  email   next  recommending  that  they  subscribe...
  • 59. Spark Streaming: Use the same abstractions for real-time, event streaming. Cluster Computing Thursday, May 1, 14 Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little code.
  • 60. Spark Job Iteration #1 1 second Data Stream… Spark Job Iteration #2 Thursday, May 1, 14 You can specify the granularity, such as all events in 1 second windows, then your Spark job is patched each window of data for processing.
  • 62. val sc = new SparkContext(...) val ssc = new StreamingContext( sc, Seconds(1)) // A DStream that will listen to server:port val lines = ssc.socketTextStream(server, port) // Word Count... val words = lines flatMap { line => line.split("""W+""") } val pairs = words map (word => (word, 1)) val wordCounts = pairs reduceByKey ((n1, n2) => n1 + n2) wordCount.print() // print a few counts... Thursday, May 1, 14 This  example  adapted  from  the  following  page  on  the  Spark  website: hBp://spark.apache.org/docs/0.9.0/streaming-­‐programming-­‐guide.html#a-­‐quick-­‐example We  create  a  StreamingContext  that  wraps  a  SparkContext  (there  are  alterna?ve  ways  to  construct  it...).  It  will  “clump”  the  events  into  1-­‐second   intervals.  Next  we  setup  a  socket  to  stream  text  to  us  from  another  server  and  port  (one  of  several  ways  to  ingest  data).
  • 63. ssc.socketTextStream(server, port) // Word Count... val words = lines flatMap { line => line.split("""W+""") } val pairs = words map (word => (word, 1)) val wordCounts = pairs reduceByKey ((n1, n2) => n1 + n2) wordCount.print() // print a few counts... ssc.start() ssc.awaitTermination() Thursday, May 1, 14 Now  the  “word  count”  happens  over  each  interval  (aggrega?on  across  intervals  is  also  possible),  but  otherwise  it  works  like  before. Once  we  setup  the  flow,  we  start  it  and  wait  for  it  to  terminate  through  some  means,  such  as  the  server  socket  closing.
  • 64. MLlib: Machine learning library. Cluster Computing Thursday, May 1, 14 Spark implements many machine learning algorithms, although a lot more or needed, compared to more mature tools like Mahout and libraries in Python and R.
  • 65. •Linear regression •Binary classification •Collaborative filtering •Clustering •Others... MLlib Thursday, May 1, 14 Not as full-featured as more mature toolkits, but the Mahout project has announced they are going to port their algorithms to Spark, which include powerful Mathematics, e.g., Matrix support libraries.
  • 66. GraphX: Graphical models and algorithms. Cluster Computing Thursday, May 1, 14 Some problems are more naturally represented as graphs. Extends RDDs to support property graphs with directed edges.
  • 67. Thursday, May 1, 14 Why is Spark so good (and Java MapReduce so bad)? Because fundamentally, data analytics is Mathematics and programming tools inspired by Mathematics - like Functional Programming - are ideal tools for working with data. This is why Spark code is so concise, yet powerful. This is why it is a great platform for performance optimizations. This is why Spark is a great platform for higher-level tools, like SQL, graphs, etc. Interest in FP started growing ~10 years ago as a tool to attack concurrency. I believe that data is now driving FP adoption even faster. I know many Java shops that switched to Scala when they adopted tools like Spark and Scalding (https://github.com/twitter/scalding).
  • 68. A flexible, scalable distributed compute platform with concise, powerful APIs and higher-order tools. spark.apache.org Spark Thursday, May 1, 14
  • 69. Why Spark Is the Next Top (Compute) Model @deanwampler polyglotprogramming.com/talks Thursday, May 1, 14 Copyright (c) 2014, Dean Wampler, All Rights Reserved, unless otherwise noted. Image: The London Eye on one side of the Thames, Parliament on the other.