THE TALE OFTHE!
GLORIOUS LAMBDAS!
&THE WERE-CLUSTERZ
Mateusz Fedoryszak!
matfed@icm.edu.pl!
Michał Oniszczuk!
micon@icm.ed...
More than the weather forecast.
MUCH MORE…
WE SPY ON SCIENTISTS
RAW DATA
COMMON MAP OF ACADEMIA
HADOOP
How to read millions of papers?
IN ONE PICTURE
Map Reduce
WORD COUNT 

ISTHE NEW HELLO WORLD
WORD COUNT 

INVANILLA MAP-REDUCE
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.had...
HOW SHOULD

A WORD COUNT

LOOK LIKE?
val lines = List("ala ma kota", "kot ma ale")!
!
val words = lines.flatMap(_.split(" ...
SCOOBI, SCALDING
Map–Reduce the right way — with lambdas.
WORD COUNT IN PURE SCALA
val lines = List("ala ma kota", "kot ma ale")!
!
val words = lines.flatMap(_.split(" "))!
val gro...
WORD COUNT IN SCOOBI
val lines = fromTextFile("hdfs://in/...")!
!
val words = lines.mapFlatten(_.split(" "))!
val groups =...
BEHINDTHE SCENES
val lines = !
fromTextFile("hdfs://in/...")!
!
val words = !
lines.mapFlatten(_.split(" "))!
val groups =...
SCOOBI SNACKS
– Joins, group-by, etc. baked in!
– Static type checking with custom data types and IO!
– One lang to rule t...
WHICH ONE IS 

THE FRAMEWORK?
Scoobi Scalding
Pure Scala Cascading wrapper
Developed by NICTA Developed byTwitter
Strongly...
THE NEW BIG DATA ZOO
Most slides are by Matei Zaharia from the Spark team
SPARK IDEA
MAPREDUCE PROBLEMS…
iter. 1 iter. 2 . . .
Input
HDFS

read
HDFS

write
HDFS

read
HDFS

write
Input
query 1
query 2
query ...
iter. 1 iter. 2 . . .
Input
… SOLVED WITH SPARK
Distributed

memory
Input
query 1
query 2
query 3
. . .
one-time

processi...
HDFS
RESILIENT DISTRIBUTED
         DATASETS (RDDS)
Restricted form of distributed shared memory
»Partitioned data
»Higher...
API
Scala, Python, Java
+ REPL
map"
reduce
filter"
groupBy
join"
…
SPARK EXAMPLES
EXAMPLE: LOG MINING
Load error messages from a log into memory, then
interactively search for various patterns
Worker
Work...
EXAMPLE: LOG MINING
Load error messages from a log into memory, then
interactively search for various patterns
lines = spa...
EXAMPLE: LOG MINING
Load error messages from a log into memory, then
interactively search for various patterns
lines = spa...
EXAMPLE: LOG MINING
Load error messages from a log into memory, then
interactively search for various patterns
lines = spa...
EXAMPLE: LOG MINING
Load error messages from a log into memory, then
interactively search for various patterns
lines = spa...
EXAMPLE: LOG MINING
Load error messages from a log into memory, then
interactively search for various patterns
lines = spa...
EXAMPLE: LOG MINING
Load error messages from a log into memory, then
interactively search for various patterns
lines = spa...
1TB data in 5-7 sec

(vs 170 sec for on-disk data)
EXAMPLE: LOG MINING
Load error messages from a log into memory, then
in...
PAGERANK PERFORMANCETimeperiteration(s)
0
45
90
135
180
23,01
170,75 Hadoop Spark
SPARK LIBRARIES
SPARK’S ZOO
Spark
Spark
Streaming

(real-time)
GraphX

(graph)
…
Shark

(SQL)
MLlib

(machine
learning)
BlinkDB
ALL IN ONE
val points = sc.runSql[Double, Double](

“select latitude, longitude from historic_tweets”)

val model = KMeans...
SPARK CONCLUSION
• In memory processing
• Libraries
• Increasingly popular
Spark
Spark
Streaming!
GraphX
…
Shark

MLlib
Bl...
USEFUL LINKS
• spark.apache.org!
• spark-summit.org !
videos & online hands–on tutorials
Like Spark but less popular and less mature
CONCLUSION
• We are in the 80’s of RDBMS
• Scala goes well with big data
THANKYOU!!
Q&A
Iterationtime(s)
0
62,5
125
187,5
250
Number of machines
25 50 100
3615
62
80
116
76
111
184
Hadoop
HadoopBinMem
Spark
Log...
Iterationtime(s)
0
25
50
75
100
Percent of working set in memory
0 0.25 0.5 0.75 1
11,5
29,7
40,7
58,1
68,8
INSUFFICIENT R...
PERFORMANCE
ResponseTime(s)
0
11,25
22,5
33,75
45
Hive
Impala (disk)
Impala (mem)
Shark (disk)
Shark (mem)
SQL
ResponseTim...
Upcoming SlideShare
Loading in …5
×

Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

2,315 views

Published on

The Tale of the
Glorious Lambdas
& the Were-Clusterz

Published in: Data & Analytics
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,315
On SlideShare
0
From Embeds
0
Number of Embeds
300
Actions
Shares
0
Downloads
147
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

  1. 1. THE TALE OFTHE! GLORIOUS LAMBDAS! &THE WERE-CLUSTERZ Mateusz Fedoryszak! matfed@icm.edu.pl! Michał Oniszczuk! micon@icm.edu.pl +
  2. 2. More than the weather forecast.
  3. 3. MUCH MORE…
  4. 4. WE SPY ON SCIENTISTS
  5. 5. RAW DATA
  6. 6. COMMON MAP OF ACADEMIA
  7. 7. HADOOP How to read millions of papers?
  8. 8. IN ONE PICTURE Map Reduce
  9. 9. WORD COUNT 
 ISTHE NEW HELLO WORLD
  10. 10. WORD COUNT 
 INVANILLA MAP-REDUCE package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {   public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  11. 11. HOW SHOULD
 A WORD COUNT
 LOOK LIKE? val lines = List("ala ma kota", "kot ma ale")! ! val words = lines.flatMap(_.split(" "))! val groups = words.groupBy(identity)" val counts = groups.map(x => (x._1, x._2.length))! ! counts.foreach(println)
  12. 12. SCOOBI, SCALDING Map–Reduce the right way — with lambdas.
  13. 13. WORD COUNT IN PURE SCALA val lines = List("ala ma kota", "kot ma ale")! ! val words = lines.flatMap(_.split(" "))! val groups = words.groupBy(identity)" val counts = groups.map(x => (x._1, x._2.size))" ! counts.foreach(println)
  14. 14. WORD COUNT IN SCOOBI val lines = fromTextFile("hdfs://in/...")! ! val words = lines.mapFlatten(_.split(" "))! val groups = words.groupBy(identity)" val counts = groups.map(x => (x._1, x._2.size))" ! counts! " .toTextFile("hdfs://out/...", overwrite=true)! " .persist()
  15. 15. BEHINDTHE SCENES val lines = ! fromTextFile("hdfs://in/...")! ! val words = ! lines.mapFlatten(_.split(" "))! val groups = ! words.groupBy(identity)" val counts = ! groups.map(x => (x._1, x._2.length))! ! counts! .toTextFile("hdfs://out/...",! overwrite=true)! .persist() flatMap groupBy map map reduce map reduce map reduce
  16. 16. SCOOBI SNACKS – Joins, group-by, etc. baked in! – Static type checking with custom data types and IO! – One lang to rule them all (and it’s THE lang)! – Easy local testing! – REPL!
  17. 17. WHICH ONE IS 
 THE FRAMEWORK? Scoobi Scalding Pure Scala Cascading wrapper Developed by NICTA Developed byTwitter Strongly typed API Field-based and strongly typed API Has cooler logo
  18. 18. THE NEW BIG DATA ZOO Most slides are by Matei Zaharia from the Spark team
  19. 19. SPARK IDEA
  20. 20. MAPREDUCE PROBLEMS… iter. 1 iter. 2 . . . Input HDFS
 read HDFS
 write HDFS
 read HDFS
 write Input query 1 query 2 query 3 result 1 result 2 result 3 . . . HDFS
 read
  21. 21. iter. 1 iter. 2 . . . Input … SOLVED WITH SPARK Distributed
 memory Input query 1 query 2 query 3 . . . one-time
 processing
  22. 22. HDFS RESILIENT DISTRIBUTED          DATASETS (RDDS) Restricted form of distributed shared memory »Partitioned data »Higher–level operations (map, filter, join, …) »No side–effects Efficient fault recovery using lineage »List of operations »Recompute lost partitions on failure »No cost if nothing fails
  23. 23. API Scala, Python, Java + REPL map" reduce filter" groupBy join" …
  24. 24. SPARK EXAMPLES
  25. 25. EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Master
  26. 26. EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Worker Worker Worker Master
  27. 27. EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Worker Worker Worker Master cachedMsgs.filter(_.contains(“foo”)).count
  28. 28. EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Master cachedMsgs.filter(_.contains(“foo”)).count
  29. 29. EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Master cachedMsgs.filter(_.contains(“foo”)).count tasks
  30. 30. EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Master cachedMsgs.filter(_.contains(“foo”)).count tasks Cache 1 Cache 2 Cache 3
  31. 31. EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Master cachedMsgs.filter(_.contains(“foo”)).count tasks results Cache 1 Cache 2 Cache 3
  32. 32. 1TB data in 5-7 sec
 (vs 170 sec for on-disk data) EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Master cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count tasks results Cache 1 Cache 2 Cache 3
  33. 33. PAGERANK PERFORMANCETimeperiteration(s) 0 45 90 135 180 23,01 170,75 Hadoop Spark
  34. 34. SPARK LIBRARIES
  35. 35. SPARK’S ZOO Spark Spark Streaming (real-time) GraphX (graph) … Shark
 (SQL) MLlib (machine learning) BlinkDB
  36. 36. ALL IN ONE val points = sc.runSql[Double, Double](
 “select latitude, longitude from historic_tweets”)
 val model = KMeans.train(points, 10)
 sc.twitterStream(...)
 .map(t => (model.closestCenter(t.location), 1))
 .reduceByWindow(“5s”, _ + _)
  37. 37. SPARK CONCLUSION • In memory processing • Libraries • Increasingly popular Spark Spark Streaming! GraphX … Shark
 MLlib BlinkDB
  38. 38. USEFUL LINKS • spark.apache.org! • spark-summit.org ! videos & online hands–on tutorials
  39. 39. Like Spark but less popular and less mature
  40. 40. CONCLUSION • We are in the 80’s of RDBMS • Scala goes well with big data
  41. 41. THANKYOU!! Q&A
  42. 42. Iterationtime(s) 0 62,5 125 187,5 250 Number of machines 25 50 100 3615 62 80 116 76 111 184 Hadoop HadoopBinMem Spark Logistic Regression Iterationtime(s) 0 75 150 225 300 Number of machines 25 50 100 33 61 143 87 121 197 106 157 274 Hadoop HadoopBinMem Spark K-Means SCALABILITY
  43. 43. Iterationtime(s) 0 25 50 75 100 Percent of working set in memory 0 0.25 0.5 0.75 1 11,5 29,7 40,7 58,1 68,8 INSUFFICIENT RAM
  44. 44. PERFORMANCE ResponseTime(s) 0 11,25 22,5 33,75 45 Hive Impala (disk) Impala (mem) Shark (disk) Shark (mem) SQL ResponseTime(min) 0 7,5 15 22,5 30 Hadoop Giraph GraphX Graph Throughput(MB/s/node) 0 9 18 26 35 Storm Spark Streaming

×