®
© 2014 MapR Technologies 1
®
© 2014 MapR Technologies
Apache Spark
Keys Botzum
Senior Principal Technologist, MapR Techn...
®
© 2014 MapR Technologies 2
Agenda
•  MapReduce
•  Apache Spark
•  How Spark Works
•  Fault Tolerance and Performance
•  ...
®
© 2014 MapR Technologies 3
MapR: Best Product, Best Business & Best
Customers
Top Ranked
Exponential
Growth
500+
Custome...
®
© 2014 MapR Technologies 4© 2014 MapR Technologies
®
Review: MapReduce
®
© 2014 MapR Technologies 5
MapReduce: A Programming Model
•  MapReduce:
Simplified Data
Processing on Large
Clusters
(pu...
®
© 2014 MapR Technologies 6
MapReduce Basics
•  Assumes scalable distributed file system that
shards data
•  Map
–  Loadi...
®
© 2014 MapR Technologies 7
MapReduce Processing Model
•  Define mappers
•  Shuffling is automatic
•  Define reducers
•  ...
®
© 2014 MapR Technologies 8
MapReduce: The Good
•  Built in fault tolerance
•  Optimized IO path
•  Scalable
•  Developer...
®
© 2014 MapR Technologies 9
MapReduce: The Bad
•  Optimized for disk IO
–  Doesn’t leverage memory well
–  Iterative algo...
®
© 2014 MapR Technologies 10© 2014 MapR Technologies
®
Apache Spark
®
© 2014 MapR Technologies 11
Apache Spark
•  spark.apache.org
•  github.com/apache/spark
•  user@spark.apache.org
•  Orig...
®
© 2014 MapR Technologies 12
Spark: Easy and Fast Big Data
•  Easy to Develop
–  Rich APIs in
Java, Scala,
Python
–  Inte...
®
© 2014 MapR Technologies 13
Resilient Distributed Datasets (RDD)
•  Spark revolves around RDDs
•  Fault-tolerant read on...
®
© 2014 MapR Technologies 14
RDD Operations - Expressive
•  Transformations
–  Creation of a new RDD dataset from an exis...
®
© 2014 MapR Technologies 15
Easy: Clean API
•  Resilient Distributed
Datasets
•  Collections of objects spread
across a ...
®
© 2014 MapR Technologies 16
Easy: Expressive API
•  map •  reduce
®
© 2014 MapR Technologies 17
Easy: Expressive API
•  map
•  filter
•  groupBy
•  sort
•  union
•  join
•  leftOuterJoin
•...
®
© 2014 MapR Technologies 18
Easy: Example – Word Count
•  Spark•  Hadoop MapReduce
public static class WordCountMapClass...
®
© 2014 MapR Technologies 19
Easy: Example – Word Count
•  Spark•  Hadoop MapReduce
public static class WordCountMapClass...
®
© 2014 MapR Technologies 20
Easy: Works Well With Hadoop
•  Data Compatibility
•  Access your existing
Hadoop Data
•  Us...
®
© 2014 MapR Technologies 21
Easy: User-Driven Roadmap
•  Language support
–  Improved Python
support
–  SparkR
–  Java 8...
®
© 2014 MapR Technologies 22
Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache()
w = numpy.ran...
®
© 2014 MapR Technologies 23
Fast: Logistic Regression Performance
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
...
®
© 2014 MapR Technologies 24
Easy: Multi-language Support
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” ...
®
© 2014 MapR Technologies 25
Easy: Interactive Shell
Scala based shell
% /opt/mapr/spark/spark-0.9.1/bin/spark-shell
scal...
®
© 2014 MapR Technologies 26© 2014 MapR Technologies
®
Fault Tolerance and Performance
®
© 2014 MapR Technologies 27
Fast: Using RAM, Operator Graphs
•  In-memory Caching
•  Data Partitions read
from RAM inste...
®
© 2014 MapR Technologies 28
Directed Acylic Graph (DAG)
•  Directed
–  Only in a single direction
•  Acyclic
–  No loopi...
®
© 2014 MapR Technologies 29
Easy: Fault Recovery
RDDs track lineage information that can be used to
efficiently recomput...
®
© 2014 MapR Technologies 30
RDD Persistence / Caching
•  Variety of storage levels
–  memory_only (default), memory_and_...
®
© 2014 MapR Technologies 31
PageRank Performance
171
80
23
14
0
50
100
150
200
30 60
Iterationtime(s)
Number of machines...
®
© 2014 MapR Technologies 32
Other Iterative Algorithms
0.96
110
0 25 50 75 100 125
Logistic
Regression
4.1
155
0 30 60 9...
®
© 2014 MapR Technologies 33
Fast: Scaling Down
69	
  
58	
  
41	
  
30	
  
12	
  
0	
  
20	
  
40	
  
60	
  
80	
  
100	...
®
© 2014 MapR Technologies 34
Comparison to Storm
•  Higher throughput than Storm
–  Spark Streaming: 670k records/sec/nod...
®
© 2014 MapR Technologies 35© 2014 MapR Technologies
®
How Spark Works
®
© 2014 MapR Technologies 36
Working With RDDs
®
© 2014 MapR Technologies 37
Working With RDDs
RDD
textFile = sc.textFile(”SomeFile.txt”)!
®
© 2014 MapR Technologies 38
Working With RDDs
RDD
RDD
RDD
RDD
Transformations
linesWithSpark = textFile.filter(lambda li...
®
© 2014 MapR Technologies 39
Working With RDDs
RDD
RDD
RDD
RDD
Transformations
Action
 Value
linesWithSpark = textFile.fi...
®
© 2014 MapR Technologies 40© 2014 MapR Technologies
®
Example: Log Mining
®
© 2014 MapR Technologies 41
Example: Log Mining
Load error messages from a log into memory, then
interactively search fo...
®
© 2014 MapR Technologies 42
Example: Log Mining
Load error messages from a log into memory, then
interactively search fo...
®
© 2014 MapR Technologies 43
Example: Log Mining
Load error messages from a log into memory, then
interactively search fo...
®
© 2014 MapR Technologies 44
Example: Log Mining
Load error messages from a log into memory, then
interactively search fo...
®
© 2014 MapR Technologies 45
Example: Log Mining
Load error messages from a log into memory, then
interactively search fo...
®
© 2014 MapR Technologies 46
Example: Log Mining
Load error messages from a log into memory, then
interactively search fo...
®
© 2014 MapR Technologies 47
Example: Log Mining
Load error messages from a log into memory, then
interactively search fo...
®
© 2014 MapR Technologies 48
Example: Log Mining
Load error messages from a log into memory, then
interactively search fo...
®
© 2014 MapR Technologies 49
Example: Log Mining
Load error messages from a log into memory, then
interactively search fo...
®
© 2014 MapR Technologies 50
Example: Log Mining
Load error messages from a log into memory, then
interactively search fo...
®
© 2014 MapR Technologies 51
Example: Log Mining
Load error messages from a log into memory, then
interactively search fo...
®
© 2014 MapR Technologies 52
Example: Log Mining
Load error messages from a log into memory, then
interactively search fo...
®
© 2014 MapR Technologies 53
Example: Log Mining
Load error messages from a log into memory, then
interactively search fo...
®
© 2014 MapR Technologies 54
Example: Log Mining
Load error messages from a log into memory, then
interactively search fo...
®
© 2014 MapR Technologies 55
Example: Log Mining
Load error messages from a log into memory, then
interactively search fo...
®
© 2014 MapR Technologies 56
Example: Log Mining
Load error messages from a log into memory, then
interactively search fo...
®
© 2014 MapR Technologies 57
Example: Log Mining
Load error messages from a log into memory, then
interactively search fo...
®
© 2014 MapR Technologies 58
Example: Log Mining
Load error messages from a log into memory, then
interactively search fo...
®
© 2014 MapR Technologies 59© 2014 MapR Technologies
®
Example: Page Rank
®
© 2014 MapR Technologies 60
Example: PageRank
•  Good example of a more complex algorithm
–  Multiple stages of map & re...
®
© 2014 MapR Technologies 61
Basic Idea
Give pages ranks
(scores) based on links
to them
•  Links from many
pages è high...
®
© 2014 MapR Technologies 62
Algorithm
1.  Start each page at a rank of 1
2.  On each iteration, have page p contribute
r...
®
© 2014 MapR Technologies 63
Algorithm
1.  Start each page at a rank of 1
2.  On each iteration, have page p contribute
r...
®
© 2014 MapR Technologies 64
Algorithm
1.  Start each page at a rank of 1
2.  On each iteration, have page p contribute
r...
®
© 2014 MapR Technologies 65
Algorithm
1.  Start each page at a rank of 1
2.  On each iteration, have page p contribute
r...
®
© 2014 MapR Technologies 66
Algorithm
1.  Start each page at a rank of 1
2.  On each iteration, have page p contribute
r...
®
© 2014 MapR Technologies 67
Algorithm
1.  Start each page at a rank of 1
2.  On each iteration, have page p contribute
r...
®
© 2014 MapR Technologies 68
Scala Implementation
val links = // load RDD of (url, neighbors) pairs
var ranks = // give e...
®
© 2014 MapR Technologies 69© 2014 MapR Technologies
®
Spark and More
®
© 2014 MapR Technologies 70
Easy: Unified Platform
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLLib
(Machine
learning)
...
®
© 2014 MapR Technologies 71
Spark on MapR
•  Certified Spark Distribution
•  Fully supported and packaged by MapR in
par...
®
© 2014 MapR Technologies 72
References
•  Based on slides from Pat McDonough at
•  Spark web site: http://spark.apache.o...
®
© 2014 MapR Technologies 73
Q&A
@mapr maprtech
kbotzum@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies
Upcoming SlideShare
Loading in...5
×

Apache Spark & Hadoop

4,272

Published on

http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now.

That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis.

This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop.

Keys Botzum - Senior Principal Technologist with MapR Technologies
Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.

Published in: Technology

Apache Spark & Hadoop

  1. 1. ® © 2014 MapR Technologies 1 ® © 2014 MapR Technologies Apache Spark Keys Botzum Senior Principal Technologist, MapR Technologies June 2014
  2. 2. ® © 2014 MapR Technologies 2 Agenda •  MapReduce •  Apache Spark •  How Spark Works •  Fault Tolerance and Performance •  Examples •  Spark and More
  3. 3. ® © 2014 MapR Technologies 3 MapR: Best Product, Best Business & Best Customers Top Ranked Exponential Growth 500+ Customers Cloud Leaders 3X bookings Q1 ‘13 – Q1 ‘14 80% of accounts expand 3X 90% software licenses <1% lifetime churn >$1B in incremental revenue generated by 1 customer
  4. 4. ® © 2014 MapR Technologies 4© 2014 MapR Technologies ® Review: MapReduce
  5. 5. ® © 2014 MapR Technologies 5 MapReduce: A Programming Model •  MapReduce: Simplified Data Processing on Large Clusters (published 2004) •  Parallel and Distributed Algorithm: •  Data Locality •  Fault Tolerance •  Linear Scalability
  6. 6. ® © 2014 MapR Technologies 6 MapReduce Basics •  Assumes scalable distributed file system that shards data •  Map –  Loading of the data and defining a set of keys •  Reduce –  Collects the organized key-based data to process and output •  Performance can be tweaked based on known details of your source files and cluster shape (size, total number)
  7. 7. ® © 2014 MapR Technologies 7 MapReduce Processing Model •  Define mappers •  Shuffling is automatic •  Define reducers •  For complex work, chain jobs together
  8. 8. ® © 2014 MapR Technologies 8 MapReduce: The Good •  Built in fault tolerance •  Optimized IO path •  Scalable •  Developer focuses on Map/Reduce, not infrastructure •  simple? API
  9. 9. ® © 2014 MapR Technologies 9 MapReduce: The Bad •  Optimized for disk IO –  Doesn’t leverage memory well –  Iterative algorithms go through disk IO path again and again •  Primitive API –  Developer’s have to build on very simple abstraction –  Key/Value in/out –  Even basic things like join require extensive code •  Result often many files that need to be combined appropriately
  10. 10. ® © 2014 MapR Technologies 10© 2014 MapR Technologies ® Apache Spark
  11. 11. ® © 2014 MapR Technologies 11 Apache Spark •  spark.apache.org •  github.com/apache/spark •  user@spark.apache.org •  Originally developed in 2009 in UC Berkeley’s AMP Lab •  Fully open sourced in 2010 – now at Apache Software Foundation - Commercial Vendor Developing/Supporting
  12. 12. ® © 2014 MapR Technologies 12 Spark: Easy and Fast Big Data •  Easy to Develop –  Rich APIs in Java, Scala, Python –  Interactive shell •  Fast to Run –  General execution graphs –  In-memory storage 2-5× less code
  13. 13. ® © 2014 MapR Technologies 13 Resilient Distributed Datasets (RDD) •  Spark revolves around RDDs •  Fault-tolerant read only collection of elements that can be operated on in parallel •  Cached in memory or on disk http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  14. 14. ® © 2014 MapR Technologies 14 RDD Operations - Expressive •  Transformations –  Creation of a new RDD dataset from an existing •  map, filter, distinct, union, sample, groupByKey, join, reduce, etc… •  Actions –  Return a value after running a computation •  collect, count, first, takeSample, foreach, etc… Check the documentation for a complete list http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd- operations
  15. 15. ® © 2014 MapR Technologies 15 Easy: Clean API •  Resilient Distributed Datasets •  Collections of objects spread across a cluster, stored in RAM or on Disk •  Built through parallel transformations •  Automatically rebuilt on failure •  Operations •  Transformations (e.g. map, filter, groupBy) •  Actions (e.g. count, collect, save) Write programs in terms of transformations on distributed datasets
  16. 16. ® © 2014 MapR Technologies 16 Easy: Expressive API •  map •  reduce
  17. 17. ® © 2014 MapR Technologies 17 Easy: Expressive API •  map •  filter •  groupBy •  sort •  union •  join •  leftOuterJoin •  rightOuterJoin •  reduce •  count •  fold •  reduceByKey •  groupByKey •  cogroup •  cross •  zip sample take first partitionBy mapWith pipe save ...
  18. 18. ® © 2014 MapR Technologies 18 Easy: Example – Word Count •  Spark•  Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  19. 19. ® © 2014 MapR Technologies 19 Easy: Example – Word Count •  Spark•  Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  20. 20. ® © 2014 MapR Technologies 20 Easy: Works Well With Hadoop •  Data Compatibility •  Access your existing Hadoop Data •  Use the same data formats •  Adheres to data locality for efficient processing •  Deployment Models •  “Standalone” deployment •  YARN-based deployment •  Mesos-based deployment •  Deploy on existing Hadoop cluster or side-by-side
  21. 21. ® © 2014 MapR Technologies 21 Easy: User-Driven Roadmap •  Language support –  Improved Python support –  SparkR –  Java 8 –  Integrated Schema and SQL support in Spark’s APIs •  Better ML –  Sparse Data Support –  Model Evaluation Framework –  Performance Testing
  22. 22. ® © 2014 MapR Technologies 22 Example: Logistic Regression data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w
  23. 23. ® © 2014 MapR Technologies 23 Fast: Logistic Regression Performance 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110  s  /  iteration   first  iteration  80  s   further  iterations  1  s  
  24. 24. ® © 2014 MapR Technologies 24 Easy: Multi-language Support Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  25. 25. ® © 2014 MapR Technologies 25 Easy: Interactive Shell Scala based shell % /opt/mapr/spark/spark-0.9.1/bin/spark-shell scala> val logs = sc.textFile("hdfs:///user/keys/logdata”)" scala> logs.count()" …" res0: Long = 232681 scala> logs.filter(l => l.contains("ERROR")).count()" …." res1: Long = 205 Python based shell as well - pyspark
  26. 26. ® © 2014 MapR Technologies 26© 2014 MapR Technologies ® Fault Tolerance and Performance
  27. 27. ® © 2014 MapR Technologies 27 Fast: Using RAM, Operator Graphs •  In-memory Caching •  Data Partitions read from RAM instead of disk •  Operator Graphs •  Scheduling Optimizations •  Fault Tolerance =  cached  partition   =  RDD   join   filter   groupBy   Stage  3   Stage  1   Stage  2   A:   B:   C:   D:   E:   F:   map  
  28. 28. ® © 2014 MapR Technologies 28 Directed Acylic Graph (DAG) •  Directed –  Only in a single direction •  Acyclic –  No looping •  This supports fault-tolerance
  29. 29. ® © 2014 MapR Technologies 29 Easy: Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter   (func  =  startsWith(…))   map   (func  =  split(...))  
  30. 30. ® © 2014 MapR Technologies 30 RDD Persistence / Caching •  Variety of storage levels –  memory_only (default), memory_and_disk, etc… •  API Calls –  persist(StorageLevel) –  cache() – shorthand for persist(StorageLevel.MEMORY_ONLY) •  Considerations –  Read from disk vs. recompute (memory_and_disk) –  Total memory storage size (memory_only_ser) –  Replicate to second node for faster fault recovery (memory_only_2) •  Think about this option if supporting a time sensitive client http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd- persistence
  31. 31. ® © 2014 MapR Technologies 31 PageRank Performance 171 80 23 14 0 50 100 150 200 30 60 Iterationtime(s) Number of machines Hadoop Spark
  32. 32. ® © 2014 MapR Technologies 32 Other Iterative Algorithms 0.96 110 0 25 50 75 100 125 Logistic Regression 4.1 155 0 30 60 90 120 150 180 K-Means Clustering Hadoop Spark Time per Iteration (s)
  33. 33. ® © 2014 MapR Technologies 33 Fast: Scaling Down 69   58   41   30   12   0   20   40   60   80   100   Cache   disabled   25%   50%   75%   Fully   cached   Execution  time  (s)   %  of  working  set  in  cache  
  34. 34. ® © 2014 MapR Technologies 34 Comparison to Storm •  Higher throughput than Storm –  Spark Streaming: 670k records/sec/node –  Storm: 115k records/sec/node –  Commercial systems: 100-500k records/sec/node 0   10   20   30   100   1000   Throughput  per  node   (MB/s)   Record  Size  (bytes)   WordCount   Spark   Storm   0   20   40   60   100   1000   Throughput  per  node   (MB/s)   Record  Size  (bytes)   Grep   Spark   Storm  
  35. 35. ® © 2014 MapR Technologies 35© 2014 MapR Technologies ® How Spark Works
  36. 36. ® © 2014 MapR Technologies 36 Working With RDDs
  37. 37. ® © 2014 MapR Technologies 37 Working With RDDs RDD textFile = sc.textFile(”SomeFile.txt”)!
  38. 38. ® © 2014 MapR Technologies 38 Working With RDDs RDD RDD RDD RDD Transformations linesWithSpark = textFile.filter(lambda line: "Spark” in line)! textFile = sc.textFile(”SomeFile.txt”)!
  39. 39. ® © 2014 MapR Technologies 39 Working With RDDs RDD RDD RDD RDD Transformations Action Value linesWithSpark = textFile.filter(lambda line: "Spark” in line)! linesWithSpark.count()! 74! ! linesWithSpark.first()! # Apache Spark! textFile = sc.textFile(”SomeFile.txt”)!
  40. 40. ® © 2014 MapR Technologies 40© 2014 MapR Technologies ® Example: Log Mining
  41. 41. ® © 2014 MapR Technologies 41 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns
  42. 42. ® © 2014 MapR Technologies 42 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver
  43. 43. ® © 2014 MapR Technologies 43 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver lines = spark.textFile(“hdfs://...”)
  44. 44. ® © 2014 MapR Technologies 44 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver lines = spark.textFile(“hdfs://...”) Base RDD
  45. 45. ® © 2014 MapR Technologies 45 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver
  46. 46. ® © 2014 MapR Technologies 46 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver Transformed RDD
  47. 47. ® © 2014 MapR Technologies 47 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count()
  48. 48. ® © 2014 MapR Technologies 48 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Action
  49. 49. ® © 2014 MapR Technologies 49 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3
  50. 50. ® © 2014 MapR Technologies 50 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver tasks tasks tasks
  51. 51. ® © 2014 MapR Technologies 51 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Read HDFS Block Read HDFS Block Read HDFS Block
  52. 52. ® © 2014 MapR Technologies 52 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data
  53. 53. ® © 2014 MapR Technologies 53 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 results results results
  54. 54. ® © 2014 MapR Technologies 54 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count()
  55. 55. ® © 2014 MapR Technologies 55 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() tasks tasks tasks Driver
  56. 56. ® © 2014 MapR Technologies 56 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Process from Cache Process from Cache Process from Cache
  57. 57. ® © 2014 MapR Technologies 57 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver results results results
  58. 58. ® © 2014 MapR Technologies 58 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Cache your data è Faster Results Full-text search of Wikipedia •  60GB on 20 EC2 machines •  0.5 sec from cache vs. 20s for on-disk
  59. 59. ® © 2014 MapR Technologies 59© 2014 MapR Technologies ® Example: Page Rank
  60. 60. ® © 2014 MapR Technologies 60 Example: PageRank •  Good example of a more complex algorithm –  Multiple stages of map & reduce •  Benefits from Spark’s in-memory caching –  Multiple iterations over the same data
  61. 61. ® © 2014 MapR Technologies 61 Basic Idea Give pages ranks (scores) based on links to them •  Links from many pages è high rank •  Link from a high-rank page è high rank Image:  en.wikipedia.org/wiki/File:PageRank-­‐hi-­‐res-­‐2.png    
  62. 62. ® © 2014 MapR Technologies 62 Algorithm 1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs 1.0   1.0   1.0   1.0  
  63. 63. ® © 2014 MapR Technologies 63 Algorithm 1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs 1.0   1.0   1.0   1.0   1   0.5   0.5   0.5   1   0.5  
  64. 64. ® © 2014 MapR Technologies 64 Algorithm 1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs 0.58   1.0   1.85   0.58  
  65. 65. ® © 2014 MapR Technologies 65 Algorithm 1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs 0.58   0.29   0.29   0.5   1.85   0.58   1.0   1.85   0.58   0.5  
  66. 66. ® © 2014 MapR Technologies 66 Algorithm 1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs 0.39   1.72   1.31   0.58   . . .
  67. 67. ® © 2014 MapR Technologies 67 Algorithm 1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs 0.46   1.37   1.44   0.73   Final  state:  
  68. 68. ® © 2014 MapR Technologies 68 Scala Implementation val links = // load RDD of (url, neighbors) pairs var ranks = // give each url rank of 1.0 for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).values.flatMap { case (urls, rank)) => urls.map(dest => (dest, rank/urls.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...) https://github.com/apache/spark/blob/master/examples/src/main/scala/org/ apache/spark/examples/SparkPageRank.scala
  69. 69. ® © 2014 MapR Technologies 69© 2014 MapR Technologies ® Spark and More
  70. 70. ® © 2014 MapR Technologies 70 Easy: Unified Platform Spark SQL (SQL) Spark Streaming (Streaming) MLLib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.,: •  BlinkDB (Approximate Queries) •  SparkR (R wrapper for Spark) •  Tachyon (off-heap RDD caching)
  71. 71. ® © 2014 MapR Technologies 71 Spark on MapR •  Certified Spark Distribution •  Fully supported and packaged by MapR in partnership with Databricks –  mapr-spark package with Spark, Shark, Spark Streaming today –  Spark-python, GraphX and MLLib soon •  YARN integration –  Spark can then allocate resources from cluster when needed
  72. 72. ® © 2014 MapR Technologies 72 References •  Based on slides from Pat McDonough at •  Spark web site: http://spark.apache.org/ •  Spark on MapR: –  http://www.mapr.com/products/apache-spark –  http://doc.mapr.com/display/MapR/Installing+Spark +and+Shark
  73. 73. ® © 2014 MapR Technologies 73 Q&A @mapr maprtech kbotzum@mapr.com Engage with us! MapR maprtech mapr-technologies
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×