Successfully reported this slideshow.

Apache Spark 101 [in 50 min]

31

Share

1 of 185
1 of 185

Apache Spark 101 [in 50 min]

31

Share

Download to read offline

"Apache Spark™ is a fast and general engine for large-scale data processing."" Above statement is taken from Apache Spark welcome page. It's one of those definitions that, while describing the product in one sentence and being 100 % true, tell still little to the wondering noob.
Why take interest in Apache Spark? Apache Spark promise being up to 100x faster than Hadoop MapReduce in certain scenarios. It provide comprehensible programming model (familiar to everyone who is used to functional programming) and vast ecosystem of tools.
In my talk I will try to reveal secrets of Apache Spark for the very beginners.
We will do first quick introduction to the set of problems commonly known as BigData: what they try to solve, what are their obstacles and challenges and how those can be addressed. We will quickly take a pick on MapReduce: theory and implementation. We will then move to Apache Spark. We will see what was the main factor that drove its creators to introduce yet another large-scala processing engine. We will see how it works, what are its main advantages. Presentation will be mix of slides and code examples.

"Apache Spark™ is a fast and general engine for large-scale data processing."" Above statement is taken from Apache Spark welcome page. It's one of those definitions that, while describing the product in one sentence and being 100 % true, tell still little to the wondering noob.
Why take interest in Apache Spark? Apache Spark promise being up to 100x faster than Hadoop MapReduce in certain scenarios. It provide comprehensible programming model (familiar to everyone who is used to functional programming) and vast ecosystem of tools.
In my talk I will try to reveal secrets of Apache Spark for the very beginners.
We will do first quick introduction to the set of problems commonly known as BigData: what they try to solve, what are their obstacles and challenges and how those can be addressed. We will quickly take a pick on MapReduce: theory and implementation. We will then move to Apache Spark. We will see what was the main factor that drove its creators to introduce yet another large-scala processing engine. We will see how it works, what are its main advantages. Presentation will be mix of slides and code examples.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Apache Spark 101 [in 50 min]

  1. 1. twitter: @rabbitonweb, email: paul.szulc@gmail.com Apache Spark 101 Large Scale Data Processing by Paweł Szulc
  2. 2. twitter: @rabbitonweb, email: paul.szulc@gmail.com Apache Spark 101 Large Scale Data Processing by Paweł Szulc email: paul.szulc@gmail.com
  3. 3. twitter: @rabbitonweb, email: paul.szulc@gmail.com Apache Spark 101 Large Scale Data Processing by Paweł Szulc email: paul.szulc@gmail.com blog: http://www.rabbitonweb.com
  4. 4. twitter: @rabbitonweb, email: paul.szulc@gmail.com Apache Spark 101 Large Scale Data Processing by Paweł Szulc (@rabbitonweb) email: paul.szulc@gmail.com blog: http://www.rabbitonweb.com
  5. 5. twitter: @rabbitonweb, email: paul.szulc@gmail.com Apache Spark 101 Large Scale Data Processing by Paweł Szulc (@rabbitonweb) [@ApacheSpark] email: paul.szulc@gmail.com blog: http://www.rabbitonweb.com
  6. 6. twitter: @rabbitonweb, email: paul.szulc@gmail.com Apache Spark 101 Large Scale Data Processing by Paweł Szulc (@rabbitonweb) [@ApacheSpark] email: paul.szulc@gmail.com blog: http://www.rabbitonweb.com IN 50 M INUTES
  7. 7. twitter: @rabbitonweb, email: paul.szulc@gmail.com Why?
  8. 8. twitter: @rabbitonweb, email: paul.szulc@gmail.com Why? buzzword: Big Data
  9. 9. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is like...
  10. 10. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is like... “Big Data is like teenage sex:
  11. 11. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is like... “Big Data is like teenage sex: everyone talks about it,
  12. 12. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is like... “Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it,
  13. 13. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is like... “Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it,
  14. 14. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is like... “Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it”
  15. 15. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is all about...
  16. 16. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is all about... ● well, the data :)
  17. 17. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is all about... ● well, the data :) ● It is said that 2.5 exabytes (2.5×10^18) of data is being created around the world every single day
  18. 18. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is all about... “Every two days, we generate as much information as we did from the dawn of civilization until 2003” -- Eric Schmidt Former CEO Google
  19. 19. twitter: @rabbitonweb, email: paul.szulc@gmail.com source: http://papyrus.greenville.edu/2014/03/selfiesteem/
  20. 20. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is all about... ● well, the data :) ● It is said that 2.5 exabytes (2.5×10^18) of data is being created around the world every single day
  21. 21. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is all about... ● well, the data :) ● It is said that 2.5 exabytes (2.5×10^18) of data is being created around the world every single day ● It's a capacity on which you can not any longer use standard tools and methods of evaluation
  22. 22. twitter: @rabbitonweb, email: paul.szulc@gmail.com Challenges of Big Data ● The gathering ● Processing and discovery ● Present it to business ● Hardware and network failures
  23. 23. twitter: @rabbitonweb, email: paul.szulc@gmail.com What was before?
  24. 24. twitter: @rabbitonweb, email: paul.szulc@gmail.com To the rescue MAP REDUCE
  25. 25. twitter: @rabbitonweb, email: paul.szulc@gmail.com To the rescue MAP REDUCE “'MapReduce' is a framework for processing parallelizable problems across huge datasets using a cluster, taking into consideration scalability and fault-tolerance”
  26. 26. twitter: @rabbitonweb, email: paul.szulc@gmail.com MapReduce - phases (1) Map Reduce is combined of sequences of two phases:
  27. 27. twitter: @rabbitonweb, email: paul.szulc@gmail.com MapReduce - phases (1) Map Reduce is combined of sequences of two phases: 1. Map
  28. 28. twitter: @rabbitonweb, email: paul.szulc@gmail.com MapReduce - phases (1) Map Reduce is combined of sequences of two phases: 1. Map 2. Reduce
  29. 29. twitter: @rabbitonweb, email: paul.szulc@gmail.com MapReduce - phases (1) Map Reduce is combined of sequences of two phases: 1. Map 2. Reduce
  30. 30. twitter: @rabbitonweb, email: paul.szulc@gmail.com MapReduce - phases (2) Map Reduce is combined of sequences of two phases: 1. Map 2. Reduce
  31. 31. twitter: @rabbitonweb, email: paul.szulc@gmail.com Map Reduce - key/value “In MapReduce, no value stands on its own. Every value has a key associated with it. Keys identify related values.
  32. 32. twitter: @rabbitonweb, email: paul.szulc@gmail.com Map Reduce - key/value “In MapReduce, no value stands on its own. Every value has a key associated with it. Keys identify related values. The mapping and reducing functions receive not just values, but (key, value) pairs. The output of each of these functions is the same: both a key and a value.”
  33. 33. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word Count ● The “Hello World” of Big Data world. ● For initial input of multiple lines, extract all words with number of occurrences To be or not to be Let it be Be me It must be Let it be be 7 to 2 let 2 or 1 not 1 must 2 me 1
  34. 34. twitter: @rabbitonweb, email: paul.szulc@gmail.com Input To be or not to be Let it be Be me It must be Let it be
  35. 35. twitter: @rabbitonweb, email: paul.szulc@gmail.com Input Splitting To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me
  36. 36. twitter: @rabbitonweb, email: paul.szulc@gmail.com Input Splitting Mapping To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me to 1 be 1 or 1 not 1 to 1 be 1 let 1 it 1 be 1 be 1 me 1 let 1 it 1 be 1 it 1 must 1 be 1
  37. 37. twitter: @rabbitonweb, email: paul.szulc@gmail.com Input Splitting Mapping Shuffling To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me to 1 be 1 or 1 not 1 to 1 be 1 let 1 it 1 be 1 be 1 me 1 let 1 it 1 be 1 it 1 must 1 be 1 be 1 be 1 be 1 be 1 be 1 be 1 to 1 to 1 or 1 not 1 let 1 let 1 must 1 me 1
  38. 38. twitter: @rabbitonweb, email: paul.szulc@gmail.com Input Splitting Mapping Shuffling Reducing To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me to 1 be 1 or 1 not 1 to 1 be 1 let 1 it 1 be 1 be 1 me 1 let 1 it 1 be 1 it 1 must 1 be 1 be 1 be 1 be 1 be 1 be 1 be 1 to 1 to 1 or 1 not 1 let 1 let 1 must 1 me 1 be 6 to 2 or 1 not 1 let 2 must 1 me 1
  39. 39. twitter: @rabbitonweb, email: paul.szulc@gmail.com Input Splitting Mapping Shuffling Reducing Final result To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me to 1 be 1 or 1 not 1 to 1 be 1 let 1 it 1 be 1 be 1 me 1 let 1 it 1 be 1 it 1 must 1 be 1 be 1 be 1 be 1 be 1 be 1 be 1 to 1 to 1 or 1 not 1 let 1 let 1 must 1 me 1 be 6 to 2 or 1 not 1 let 2 must 1 me 1 be 6 to 2 let 2 or 1 not 1 must 2 me 1
  40. 40. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count - pseudo-code function map(String name, String document): for each word w in document: emit (w, 1)
  41. 41. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count - pseudo-code function map(String name, String document): for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)
  42. 42. twitter: @rabbitonweb, email: paul.szulc@gmail.com Why?
  43. 43. twitter: @rabbitonweb, email: paul.szulc@gmail.com Why Apache Spark? We have MapReduce open-sourced implementation (Hadoop) running successfully for the last 10 years. Why to bother?
  44. 44. twitter: @rabbitonweb, email: paul.szulc@gmail.com Problems with Map Reduce 1. MapReduce provides a difficult programming model for developers
  45. 45. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count - revisited function map(String name, String document): for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)
  46. 46. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count: Hadoop implementation 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { sum += val.get(); } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 46 Job job = new Job(conf, "wordcount"); 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 54 job.setInputFormatClass(TextInputFormat.class);
  47. 47. twitter: @rabbitonweb, email: paul.szulc@gmail.com Hadoop addressing the issue
  48. 48. twitter: @rabbitonweb, email: paul.szulc@gmail.com Hadoop addressing the issue ● Hive - SQL on Hadoop Cluster
  49. 49. twitter: @rabbitonweb, email: paul.szulc@gmail.com Hadoop addressing the issue ● Hive - SQL on Hadoop Cluster, ● Declarative language
  50. 50. twitter: @rabbitonweb, email: paul.szulc@gmail.com Hadoop addressing the issue ● Hive - SQL on Hadoop Cluster, ● Declarative language ● But…
  51. 51. twitter: @rabbitonweb, email: paul.szulc@gmail.com Declarative? select count(distinct user_id) from logs;
  52. 52. twitter: @rabbitonweb, email: paul.szulc@gmail.com Declarative? select count(distinct user_id) from logs;
  53. 53. twitter: @rabbitonweb, email: paul.szulc@gmail.com Declarative? select count(distinct user_id) from logs; select (count(*) from (select distinct user_id from logs);
  54. 54. twitter: @rabbitonweb, email: paul.szulc@gmail.com Declarative? select count(distinct user_id) from logs; select (count(*) from (select distinct user_id from logs);
  55. 55. twitter: @rabbitonweb, email: paul.szulc@gmail.com Problems with Map Reduce 1. MapReduce provides a difficult programming model for developers
  56. 56. twitter: @rabbitonweb, email: paul.szulc@gmail.com Problems with Map Reduce 1. MapReduce provides a difficult programming model for developers 2. It suffers from a number of performance issues
  57. 57. twitter: @rabbitonweb, email: paul.szulc@gmail.com Performance issues ● Map-Reduce pair combination
  58. 58. twitter: @rabbitonweb, email: paul.szulc@gmail.com Performance issues ● Map-Reduce pair combination ● Output saved to the file
  59. 59. twitter: @rabbitonweb, email: paul.szulc@gmail.com Performance issues ● Map-Reduce pair combination ● Output saved to the file ● Iterative algorithms go through IO path again and again
  60. 60. twitter: @rabbitonweb, email: paul.szulc@gmail.com Performance issues ● Map-Reduce pair combination ● Output saved to the file ● Iterative algorithms go through IO path again and again ● Poor API (key, value), even basic join requires expensive code
  61. 61. twitter: @rabbitonweb, email: paul.szulc@gmail.com Problems with Map Reduce 1. MapReduce provides a difficult programming model for developers 2. It suffers from a number of performance issues
  62. 62. twitter: @rabbitonweb, email: paul.szulc@gmail.com Problems with Map Reduce 1. MapReduce provides a difficult programming model for developers 2. It suffers from a number of performance issues 3. While batch-mode analysis is still important, reacting to events as they arrive has become more important (lack support of “almost” real-time)
  63. 63. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark to the rescue
  64. 64. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark to the rescue 1. Intuitive programming model
  65. 65. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again
  66. 66. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines Scala solution
  67. 67. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) Scala solution
  68. 68. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split(“ ”)).toSeq Scala solution
  69. 69. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split(“ ”)).toSeq .groupBy(word => 1) Scala solution
  70. 70. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split(“ ”)).toSeq .groupBy(word => 1) .map { case (word, group) => (word, group.size) } Scala solution
  71. 71. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split(“ ”)).toSeq .groupBy(word => 1) .map { case (word, group) => (word, group.size) } val wc = new SparkContext(“local”, “Word Count”).textFile(args(0)) Scala solution Spark solution (in Scala language)
  72. 72. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split(“ ”)).toSeq .groupBy(word => 1) .map { case (word, group) => (word, group.size) } val wc = new SparkContext(“local”, “Word Count”).textFile(args(0)) .map(line => line.toLowerCase) Scala solution Spark solution (in Scala language)
  73. 73. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split(“ ”)).toSeq .groupBy(word => 1) .map { case (word, group) => (word, group.size) } val wc = new SparkContext(“local”, “Word Count”).textFile(args(0)) .map(line => line.toLowerCase) .flatMap(line => line.split(“ ”)) Scala solution Spark solution (in Scala language)
  74. 74. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split(“ ”)).toSeq .groupBy(word => 1) .map { case (word, group) => (word, group.size) } val wc = new SparkContext(“local”, “Word Count”).textFile(args(0)) .map(line => line.toLowerCase) .flatMap(line => line.split(“ ”)) .groupBy(word => 1) Scala solution Spark solution (in Scala language)
  75. 75. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split(“ ”)).toSeq .groupBy(word => 1) .map { case (word, group) => (word, group.size) } val wc = new SparkContext(“local”, “Word Count”).textFile(args(0)) .map(line => line.toLowerCase) .flatMap(line => line.split(“ ”)) .groupBy(word => 1) .map { case (word, group) => (word, group.size) } Scala solution Spark solution (in Scala language)
  76. 76. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark to the rescue 1. Intuitive programming model
  77. 77. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark to the rescue 1. Intuitive programming model 2. Performance boost
  78. 78. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - not tied to map- reduce cycle
  79. 79. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - not tied to map- reduce cycle map
  80. 80. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - not tied to map- reduce cycle map groupy
  81. 81. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - not tied to map- reduce cycle map groupy map
  82. 82. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - not tied to map- reduce cycle map groupy map reduceByKey
  83. 83. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - not tied to map- reduce cycle map groupy map reduceByKey task
  84. 84. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - not tied to map- reduce cycle map groupy map reduceByKey task
  85. 85. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - not tied to map- reduce cycle map groupy map reduceByKey task
  86. 86. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - not tied to map- reduce cycle map groupy map reduceByKey task Wait for calculations on all partitions before moving on
  87. 87. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - not tied to map- reduce cycle map groupy map reduceByKey task
  88. 88. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - not tied to map- reduce cycle map groupy map reduceByKey task task
  89. 89. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - not tied to map- reduce cycle map groupy map reduceByKey
  90. 90. twitter: @rabbitonweb, email: paul.szulc@gmail.com stage1 Spark performance - not tied to map- reduce cycle map groupy map reduceByKey
  91. 91. twitter: @rabbitonweb, email: paul.szulc@gmail.com sda stage2stage1 Spark performance - not tied to map- reduce cycle map groupy map reduceByKey
  92. 92. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - shuffle optimization map groupBy
  93. 93. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - shuffle optimization map groupBy
  94. 94. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - shuffle optimization map groupBy join
  95. 95. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - shuffle optimization map groupBy join
  96. 96. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - shuffle optimization map groupBy join Optimization: shuffle avoided if data is already partitioned
  97. 97. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - caching
  98. 98. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - caching
  99. 99. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - caching
  100. 100. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - vs Hadoop (1) “Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.”
  101. 101. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - vs Hadoop (3) “(...) we decided to participate in the Sort Benchmark (...), an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records).
  102. 102. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - vs Hadoop (3) “(...) we decided to participate in the Sort Benchmark (...), an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records). The previous world record was 72 minutes, set by (...) Hadoop (...) cluster of 2100 nodes.
  103. 103. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - vs Hadoop (3) “(...) we decided to participate in the Sort Benchmark (...), an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records). The previous world record was 72 minutes, set by (...) Hadoop (...) cluster of 2100 nodes. Using Spark on 206 nodes, we completed the benchmark in 23 minutes.
  104. 104. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - vs Hadoop (3) “(...) we decided to participate in the Sort Benchmark (...), an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records). The previous world record was 72 minutes, set by (...) Hadoop (...) cluster of 2100 nodes. Using Spark on 206 nodes, we completed the benchmark in 23 minutes. This means that Spark sorted the same data 3X faster using 10X fewer machines.
  105. 105. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - vs Hadoop (3) “(...) we decided to participate in the Sort Benchmark (...), an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records). The previous world record was 72 minutes, set by (...) Hadoop (...) cluster of 2100 nodes. Using Spark on 206 nodes, we completed the benchmark in 23 minutes. This means that Spark sorted the same data 3X faster using 10X fewer machines. All (...) without using Spark’s in-memory cache.”
  106. 106. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark to the rescue 1. Intuitive programming model 2. Performance boost
  107. 107. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark to the rescue 1. Intuitive programming model 2. Performance boost 3. Spark Streaming
  108. 108. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark to the rescue 1. Intuitive programming model 2. Performance boost 3. Spark Streaming ○ but also: graphs, machine learning and SQL
  109. 109. twitter: @rabbitonweb, email: paul.szulc@gmail.com How?
  110. 110. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Cluster (Standalone, Yarn, Mesos)
  111. 111. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos)
  112. 112. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) SPARK API: 1. Scala 2. Java 3. Python
  113. 113. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) SPARK API: 1. Scala 2. Java 3. Python Master
  114. 114. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master
  115. 115. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt”
  116. 116. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master)
  117. 117. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf)
  118. 118. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) Executor 1 Executor 2 Executor 3
  119. 119. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) Executor 1 Executor 2 Executor 3 HDFS, GlusterFS, locality
  120. 120. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2 T3
  121. 121. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1T2T3
  122. 122. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1T2T3
  123. 123. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2T3
  124. 124. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2 T3
  125. 125. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2 T3
  126. 126. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2 T3
  127. 127. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2 T3
  128. 128. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2 T3
  129. 129. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2 T3
  130. 130. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2 T3
  131. 131. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2 T3
  132. 132. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1T2T3
  133. 133. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1T2T3
  134. 134. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2T3
  135. 135. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2 T3
  136. 136. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2 T3
  137. 137. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2 T3
  138. 138. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2 T3
  139. 139. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2 T3
  140. 140. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 EDeDEADutor 3 T1 T2
  141. 141. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 T1 T2
  142. 142. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 T1 T2 T3
  143. 143. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 T1 T2 T3
  144. 144. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 T1 T2 T3
  145. 145. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 T1 T2 T3
  146. 146. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 T1 T2 T3
  147. 147. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 T1 T2 T3
  148. 148. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - the definition
  149. 149. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - the definition RDD stands for resilient distributed dataset
  150. 150. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - the definition RDD stands for resilient distributed dataset Resilient - if data is lost, data can be recreated
  151. 151. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - the definition RDD stands for resilient distributed dataset Resilient - if data is lost, data can be recreated Distributed - stored in nodes among the cluster
  152. 152. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - the definition RDD stands for resilient distributed dataset Resilient - if data is lost, data can be recreated Distributed - stored in nodes among the cluster Dataset - initial data comes from a file or can be created programmatically
  153. 153. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example
  154. 154. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("hdfs://logs.txt")
  155. 155. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("hdfs://logs.txt") From Hadoop Distributed File System
  156. 156. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("hdfs://logs.txt") From Hadoop Distributed File System This is the RDD
  157. 157. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("/home/rabbit/logs.txt") From local file system (must be available on executors) This is the RDD
  158. 158. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.parallelize(List(1, 2, 3, 4)) Programmatically from a collection of elements This is the RDD
  159. 159. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt")
  160. 160. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase)
  161. 161. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) Creates a new RDD
  162. 162. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”))
  163. 163. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) And yet another RDD
  164. 164. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) And yet another RDD Performance Alert?!?!
  165. 165. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - Operations 1. Transformations a. Map b. Filter c. FlatMap d. Sample e. Union f. Intersect g. Distinct h. GroupByKey i. …. 2. Actions a. Reduce b. Collect c. Count d. First e. Take(n) f. TakeSample g. SaveAsTextFile h. ….
  166. 166. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”))
  167. 167. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) val numberOfErrors = errors.count
  168. 168. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) val numberOfErrors = errors.count This will trigger the computation
  169. 169. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) val numberOfErrors = errors.count This will trigger the computation This will the calculated value (Int)
  170. 170. twitter: @rabbitonweb, email: paul.szulc@gmail.com DEMO (1) with Spark REPL
  171. 171. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark Stack
  172. 172. Spark Stack
  173. 173. Why Spark Streaming
  174. 174. Why Spark Streaming A need to process data in almost real-time ● monitoring ● web logs analysis ● fraud detection ● online ads
  175. 175. Why Spark Streaming A need to process data in almost real-time ● monitoring ● web logs analysis ● fraud detection ● online ads Problem: no framework to do both batch & stream processing
  176. 176. How Spark Streaming works? Spark Streaming live streamed data
  177. 177. How Spark Streaming works? Spark Streaming RDD RDD RDD live streamed data small RDDs
  178. 178. How Spark Streaming works? Spark Streaming RDD RDD RDD Spark Core live streamed data small RDDs output data
  179. 179. Spark Streaming - Usage val ssc = new StreamingContext(conf, Seconds(1)) Similar to SparkContext, we need to have an entry point for the new API
  180. 180. Spark Streaming - Usage val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) DStream is created (think of it as streamed RDD)
  181. 181. Spark Streaming - Usage val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) Exact same API as for RDD
  182. 182. Spark Streaming - Usage val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) ssc.start()
  183. 183. twitter: @rabbitonweb, email: paul.szulc@gmail.com Q&A … if I manage
  184. 184. twitter: @rabbitonweb, email: paul.szulc@gmail.com Q&A paul.szulc@gmail.com, @rabbitonweb http://www.rabbitonweb.com … if I manage
  185. 185. twitter: @rabbitonweb, email: paul.szulc@gmail.com Thank you very much!

×