Your SlideShare is downloading. ×
0
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Scala and big data in ICM. Scoobie, Scalding, Spark, Stratosphere. Scalar 2014

1,619

Published on

The Tale of the …

The Tale of the
Glorious Lambdas
& the Were-Clusterz

Published in: Data & Analytics
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,619
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
30
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. THE TALE OFTHE! GLORIOUS LAMBDAS! &THE WERE-CLUSTERZ Mateusz Fedoryszak! matfed@icm.edu.pl! Michał Oniszczuk! micon@icm.edu.pl +
  • 2. More than the weather forecast.
  • 3. MUCH MORE…
  • 4. WE SPY ON SCIENTISTS
  • 5. RAW DATA
  • 6. COMMON MAP OF ACADEMIA
  • 7. HADOOP How to read millions of papers?
  • 8. IN ONE PICTURE Map Reduce
  • 9. WORD COUNT 
 ISTHE NEW HELLO WORLD
  • 10. WORD COUNT 
 INVANILLA MAP-REDUCE package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {   public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  • 11. HOW SHOULD
 A WORD COUNT
 LOOK LIKE? val lines = List("ala ma kota", "kot ma ale")! ! val words = lines.flatMap(_.split(" "))! val groups = words.groupBy(identity)" val counts = groups.map(x => (x._1, x._2.length))! ! counts.foreach(println)
  • 12. SCOOBI, SCALDING Map–Reduce the right way — with lambdas.
  • 13. WORD COUNT IN PURE SCALA val lines = List("ala ma kota", "kot ma ale")! ! val words = lines.flatMap(_.split(" "))! val groups = words.groupBy(identity)" val counts = groups.map(x => (x._1, x._2.size))" ! counts.foreach(println)
  • 14. WORD COUNT IN SCOOBI val lines = fromTextFile("hdfs://in/...")! ! val words = lines.mapFlatten(_.split(" "))! val groups = words.groupBy(identity)" val counts = groups.map(x => (x._1, x._2.size))" ! counts! " .toTextFile("hdfs://out/...", overwrite=true)! " .persist()
  • 15. BEHINDTHE SCENES val lines = ! fromTextFile("hdfs://in/...")! ! val words = ! lines.mapFlatten(_.split(" "))! val groups = ! words.groupBy(identity)" val counts = ! groups.map(x => (x._1, x._2.length))! ! counts! .toTextFile("hdfs://out/...",! overwrite=true)! .persist() flatMap groupBy map map reduce map reduce map reduce
  • 16. SCOOBI SNACKS – Joins, group-by, etc. baked in! – Static type checking with custom data types and IO! – One lang to rule them all (and it’s THE lang)! – Easy local testing! – REPL!
  • 17. WHICH ONE IS 
 THE FRAMEWORK? Scoobi Scalding Pure Scala Cascading wrapper Developed by NICTA Developed byTwitter Strongly typed API Field-based and strongly typed API Has cooler logo
  • 18. THE NEW BIG DATA ZOO Most slides are by Matei Zaharia from the Spark team
  • 19. SPARK IDEA
  • 20. MAPREDUCE PROBLEMS… iter. 1 iter. 2 . . . Input HDFS
 read HDFS
 write HDFS
 read HDFS
 write Input query 1 query 2 query 3 result 1 result 2 result 3 . . . HDFS
 read
  • 21. iter. 1 iter. 2 . . . Input … SOLVED WITH SPARK Distributed
 memory Input query 1 query 2 query 3 . . . one-time
 processing
  • 22. HDFS RESILIENT DISTRIBUTED          DATASETS (RDDS) Restricted form of distributed shared memory »Partitioned data »Higher–level operations (map, filter, join, …) »No side–effects Efficient fault recovery using lineage »List of operations »Recompute lost partitions on failure »No cost if nothing fails
  • 23. API Scala, Python, Java + REPL map" reduce filter" groupBy join" …
  • 24. SPARK EXAMPLES
  • 25. EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Master
  • 26. EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Worker Worker Worker Master
  • 27. EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Worker Worker Worker Master cachedMsgs.filter(_.contains(“foo”)).count
  • 28. EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Master cachedMsgs.filter(_.contains(“foo”)).count
  • 29. EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Master cachedMsgs.filter(_.contains(“foo”)).count tasks
  • 30. EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Master cachedMsgs.filter(_.contains(“foo”)).count tasks Cache 1 Cache 2 Cache 3
  • 31. EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Master cachedMsgs.filter(_.contains(“foo”)).count tasks results Cache 1 Cache 2 Cache 3
  • 32. 1TB data in 5-7 sec
 (vs 170 sec for on-disk data) EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Master cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count tasks results Cache 1 Cache 2 Cache 3
  • 33. PAGERANK PERFORMANCETimeperiteration(s) 0 45 90 135 180 23,01 170,75 Hadoop Spark
  • 34. SPARK LIBRARIES
  • 35. SPARK’S ZOO Spark Spark Streaming (real-time) GraphX (graph) … Shark
 (SQL) MLlib (machine learning) BlinkDB
  • 36. ALL IN ONE val points = sc.runSql[Double, Double](
 “select latitude, longitude from historic_tweets”)
 val model = KMeans.train(points, 10)
 sc.twitterStream(...)
 .map(t => (model.closestCenter(t.location), 1))
 .reduceByWindow(“5s”, _ + _)
  • 37. SPARK CONCLUSION • In memory processing • Libraries • Increasingly popular Spark Spark Streaming! GraphX … Shark
 MLlib BlinkDB
  • 38. USEFUL LINKS • spark.apache.org! • spark-summit.org ! videos & online hands–on tutorials
  • 39. Like Spark but less popular and less mature
  • 40. CONCLUSION • We are in the 80’s of RDBMS • Scala goes well with big data
  • 41. THANKYOU!! Q&A
  • 42. Iterationtime(s) 0 62,5 125 187,5 250 Number of machines 25 50 100 3615 62 80 116 76 111 184 Hadoop HadoopBinMem Spark Logistic Regression Iterationtime(s) 0 75 150 225 300 Number of machines 25 50 100 33 61 143 87 121 197 106 157 274 Hadoop HadoopBinMem Spark K-Means SCALABILITY
  • 43. Iterationtime(s) 0 25 50 75 100 Percent of working set in memory 0 0.25 0.5 0.75 1 11,5 29,7 40,7 58,1 68,8 INSUFFICIENT RAM
  • 44. PERFORMANCE ResponseTime(s) 0 11,25 22,5 33,75 45 Hive Impala (disk) Impala (mem) Shark (disk) Shark (mem) SQL ResponseTime(min) 0 7,5 15 22,5 30 Hadoop Giraph GraphX Graph Throughput(MB/s/node) 0 9 18 26 35 Storm Spark Streaming

×