Successfully reported this slideshow.

Real-time Analytics with Cassandra, Spark, and Shark

66

Share

1 of 128
1 of 128

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Real-time Analytics with Cassandra, Spark, and Shark

  1. 1. Real-time Analytics with Cassandra, Spark and Shark Tuesday, June 18, 13
  2. 2. Who is this guy • Staff Engineer, Compute and Data Services, Ooyala • Building multiple web-scale real-time systems on top of C*, Kafka, Storm, etc. • Scala/Akka guy • Very excited by open source, big data projects - share some today • @evanfchan Tuesday, June 18, 13
  3. 3. Agenda Tuesday, June 18, 13
  4. 4. Agenda • Ooyala and Cassandra Tuesday, June 18, 13
  5. 5. Agenda • Ooyala and Cassandra • What problem are we trying to solve? Tuesday, June 18, 13
  6. 6. Agenda • Ooyala and Cassandra • What problem are we trying to solve? • Spark and Shark Tuesday, June 18, 13
  7. 7. Agenda • Ooyala and Cassandra • What problem are we trying to solve? • Spark and Shark • Our Spark/Cassandra Architecture Tuesday, June 18, 13
  8. 8. Agenda • Ooyala and Cassandra • What problem are we trying to solve? • Spark and Shark • Our Spark/Cassandra Architecture • Demo Tuesday, June 18, 13
  9. 9. Cassandra at Ooyala Who is Ooyala, and how we use Cassandra Tuesday, June 18, 13
  10. 10. CONFIDENTIAL—DO NOT DISTRIBUTE OOYALA Powering personalized video experiences across all screens. 5 Tuesday, June 18, 13
  11. 11. CONFIDENTIAL—DO NOT DISTRIBUTE 6CONFIDENTIAL—DO NOT DISTRIBUTE Founded in 2007 Commercially launch in 2009 230+ employees in Silicon Valley, LA, NYC, London, Paris, Tokyo, Sydney & Guadalajara Global footprint, 200M unique users, 110+ countries, and more than 6,000 websites Over 1 billion videos played per month and 2 billion analytic events per day 25% of U.S. online viewers watch video powered by Ooyala COMPANY OVERVIEW Tuesday, June 18, 13
  12. 12. CONFIDENTIAL—DO NOT DISTRIBUTE 7 TRUSTED VIDEO PARTNER STRATEGIC PARTNERS CUSTOMERS CONFIDENTIAL—DO NOT DISTRIBUTE Tuesday, June 18, 13
  13. 13. We are a large Cassandra user Tuesday, June 18, 13
  14. 14. We are a large Cassandra user • 11 clusters ranging in size from 3 to 36 nodes Tuesday, June 18, 13
  15. 15. We are a large Cassandra user • 11 clusters ranging in size from 3 to 36 nodes • Total of 28TB of data managed over ~85 nodes Tuesday, June 18, 13
  16. 16. We are a large Cassandra user • 11 clusters ranging in size from 3 to 36 nodes • Total of 28TB of data managed over ~85 nodes • Over 2 billion C* column writes per day Tuesday, June 18, 13
  17. 17. We are a large Cassandra user • 11 clusters ranging in size from 3 to 36 nodes • Total of 28TB of data managed over ~85 nodes • Over 2 billion C* column writes per day • Powers all of our analytics infrastructure Tuesday, June 18, 13
  18. 18. We are a large Cassandra user • 11 clusters ranging in size from 3 to 36 nodes • Total of 28TB of data managed over ~85 nodes • Over 2 billion C* column writes per day • Powers all of our analytics infrastructure • Much much bigger cluster coming soon Tuesday, June 18, 13
  19. 19. What problem are we trying to solve? Lots of data, complex queries, answered really quickly... but how?? Tuesday, June 18, 13
  20. 20. From mountains of useless data... Tuesday, June 18, 13
  21. 21. To nuggets of truth... Tuesday, June 18, 13
  22. 22. To nuggets of truth... • Quickly Tuesday, June 18, 13
  23. 23. To nuggets of truth... • Quickly • Painlessly Tuesday, June 18, 13
  24. 24. To nuggets of truth... • Quickly • Painlessly • At scale? Tuesday, June 18, 13
  25. 25. Today: Precomputed aggregates Tuesday, June 18, 13
  26. 26. Today: Precomputed aggregates • Video metrics computed along several high cardinality dimensions Tuesday, June 18, 13
  27. 27. Today: Precomputed aggregates • Video metrics computed along several high cardinality dimensions • Very fast lookups, but inflexible, and hard to change Tuesday, June 18, 13
  28. 28. Today: Precomputed aggregates • Video metrics computed along several high cardinality dimensions • Very fast lookups, but inflexible, and hard to change • Most computed aggregates are never read Tuesday, June 18, 13
  29. 29. Today: Precomputed aggregates • Video metrics computed along several high cardinality dimensions • Very fast lookups, but inflexible, and hard to change • Most computed aggregates are never read • What if we need more dynamic queries? – Top content for mobile users in France – Engagement curves for users who watched recommendations – Data mining, trends, machine learning Tuesday, June 18, 13
  30. 30. The static - dynamic continuum • Super fast lookups • Inflexible, wasteful • Best for 80% most common queries • Always compute results from raw data • Flexible but slow 100% Precomputation 100% Dynamic Tuesday, June 18, 13
  31. 31. The static - dynamic continuum • Super fast lookups • Inflexible, wasteful • Best for 80% most common queries • Always compute results from raw data • Flexible but slow 100% Precomputation100% Dynamic Tuesday, June 18, 13
  32. 32. Where we want to be Partly dynamic • Pre-aggregate most common queries • Flexible, fast dynamic queries • Easily generate many materialized views Tuesday, June 18, 13
  33. 33. Industry Trends Tuesday, June 18, 13
  34. 34. Industry Trends • Fast execution frameworks – Impala Tuesday, June 18, 13
  35. 35. Industry Trends • Fast execution frameworks – Impala • In-memory databases – VoltDB, Druid Tuesday, June 18, 13
  36. 36. Industry Trends • Fast execution frameworks – Impala • In-memory databases – VoltDB, Druid • Streaming and real-time Tuesday, June 18, 13
  37. 37. Industry Trends • Fast execution frameworks – Impala • In-memory databases – VoltDB, Druid • Streaming and real-time • Higher-level, productive data frameworks – Cascading, Hive, Pig Tuesday, June 18, 13
  38. 38. Why Spark and Shark? “Lightning-fast in-memory cluster computing” Tuesday, June 18, 13
  39. 39. Introduction to Spark Tuesday, June 18, 13
  40. 40. Introduction to Spark • In-memory distributed computing framework Tuesday, June 18, 13
  41. 41. Introduction to Spark • In-memory distributed computing framework • Created by UC Berkeley AMP Lab in 2010 Tuesday, June 18, 13
  42. 42. Introduction to Spark • In-memory distributed computing framework • Created by UC Berkeley AMP Lab in 2010 • Targeted problems that MR is bad at: – Iterative algorithms (machine learning) – Interactive data mining Tuesday, June 18, 13
  43. 43. Introduction to Spark • In-memory distributed computing framework • Created by UC Berkeley AMP Lab in 2010 • Targeted problems that MR is bad at: – Iterative algorithms (machine learning) – Interactive data mining • More general purpose than Hadoop MR Tuesday, June 18, 13
  44. 44. Introduction to Spark • In-memory distributed computing framework • Created by UC Berkeley AMP Lab in 2010 • Targeted problems that MR is bad at: – Iterative algorithms (machine learning) – Interactive data mining • More general purpose than Hadoop MR • Active contributions from ~ 15 companies Tuesday, June 18, 13
  45. 45. HDFS Map Reduce Map Reduce Tuesday, June 18, 13
  46. 46. HDFS Map Reduce Map Reduce Tuesday, June 18, 13
  47. 47. HDFS Map Reduce Map Reduce Data Source map() join() Source 2 Tuesday, June 18, 13
  48. 48. HDFS Map Reduce Map Reduce Data Source map() join() Source 2 cache() Tuesday, June 18, 13
  49. 49. HDFS Map Reduce Map Reduce Data Source map() join() Source 2 cache() transform Tuesday, June 18, 13
  50. 50. Throughput: Memory is king 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Tuesday, June 18, 13
  51. 51. Throughput: Memory is king 0 37500 75000 112500 150000 C*, cold cache C*, warm cache Spark RDD 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Tuesday, June 18, 13
  52. 52. Throughput: Memory is king 0 37500 75000 112500 150000 C*, cold cache C*, warm cache Spark RDD 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Tuesday, June 18, 13
  53. 53. Throughput: Memory is king 0 37500 75000 112500 150000 C*, cold cache C*, warm cache Spark RDD 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Tuesday, June 18, 13
  54. 54. Throughput: Memory is king 0 37500 75000 112500 150000 C*, cold cache C*, warm cache Spark RDD 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Tuesday, June 18, 13
  55. 55. Developers love it Tuesday, June 18, 13
  56. 56. Developers love it • “I wrote my first aggregation job in 30 minutes” Tuesday, June 18, 13
  57. 57. Developers love it • “I wrote my first aggregation job in 30 minutes” • High level “distributed collections” API Tuesday, June 18, 13
  58. 58. Developers love it • “I wrote my first aggregation job in 30 minutes” • High level “distributed collections” API • No Hadoop cruft Tuesday, June 18, 13
  59. 59. Developers love it • “I wrote my first aggregation job in 30 minutes” • High level “distributed collections” API • No Hadoop cruft • Full power of Scala, Java, Python Tuesday, June 18, 13
  60. 60. Developers love it • “I wrote my first aggregation job in 30 minutes” • High level “distributed collections” API • No Hadoop cruft • Full power of Scala, Java, Python • Interactive REPL shell Tuesday, June 18, 13
  61. 61. Developers love it • “I wrote my first aggregation job in 30 minutes” • High level “distributed collections” API • No Hadoop cruft • Full power of Scala, Java, Python • Interactive REPL shell • EASY testing!! Tuesday, June 18, 13
  62. 62. Developers love it • “I wrote my first aggregation job in 30 minutes” • High level “distributed collections” API • No Hadoop cruft • Full power of Scala, Java, Python • Interactive REPL shell • EASY testing!! • Low latency - quick development cycles Tuesday, June 18, 13
  63. 63. Spark word count example 1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 } Tuesday, June 18, 13
  64. 64. Spark word count example file = spark.textFile("hdfs://...")   file.flatMap(line => line.split(" "))     .map(word => (word, 1))     .reduceByKey(_ + _) 1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 } Tuesday, June 18, 13
  65. 65. The Spark Ecosystem Spark Tachyon - in-memory caching DFS Tuesday, June 18, 13
  66. 66. The Spark Ecosystem Bagel - Pregel on Spark Spark Tachyon - in-memory caching DFS Tuesday, June 18, 13
  67. 67. The Spark Ecosystem Bagel - Pregel on Spark HIVE on Spark Spark Tachyon - in-memory caching DFS Tuesday, June 18, 13
  68. 68. The Spark Ecosystem Bagel - Pregel on Spark HIVE on Spark Spark Streaming - discretized stream processing Spark Tachyon - in-memory caching DFS Tuesday, June 18, 13
  69. 69. Shark - HIVE on Spark Tuesday, June 18, 13
  70. 70. Shark - HIVE on Spark • 100% HiveQL compatible Tuesday, June 18, 13
  71. 71. Shark - HIVE on Spark • 100% HiveQL compatible • 10-100x faster than HIVE, answers in seconds Tuesday, June 18, 13
  72. 72. Shark - HIVE on Spark • 100% HiveQL compatible • 10-100x faster than HIVE, answers in seconds • Reuse UDFs, SerDe’s, StorageHandlers Tuesday, June 18, 13
  73. 73. Shark - HIVE on Spark • 100% HiveQL compatible • 10-100x faster than HIVE, answers in seconds • Reuse UDFs, SerDe’s, StorageHandlers • Can use DSE / CassandraFS for Metastore Tuesday, June 18, 13
  74. 74. Shark - HIVE on Spark • 100% HiveQL compatible • 10-100x faster than HIVE, answers in seconds • Reuse UDFs, SerDe’s, StorageHandlers • Can use DSE / CassandraFS for Metastore • Easy Scala/Java integration via Spark - easier than writing UDFs Tuesday, June 18, 13
  75. 75. Our new analytics architecture How we integrate Cassandra and Spark/Shark Tuesday, June 18, 13
  76. 76. From raw events to fast queries Raw Events Raw Events Raw Events Tuesday, June 18, 13
  77. 77. From raw events to fast queries Ingestion C* event store Raw Events Raw Events Raw Events Tuesday, June 18, 13
  78. 78. From raw events to fast queries Ingestion C* event store Raw Events Raw Events Raw Events Spark Spark Spark View 1 View 2 View 3 Tuesday, June 18, 13
  79. 79. From raw events to fast queries Ingestion C* event store Raw Events Raw Events Raw Events Spark Spark Spark View 1 View 2 View 3 Spark Predefined queries Tuesday, June 18, 13
  80. 80. From raw events to fast queries Ingestion C* event store Raw Events Raw Events Raw Events Spark Spark Spark View 1 View 2 View 3 Spark Shark Predefined queries Ad-hoc HiveQL Tuesday, June 18, 13
  81. 81. Our Spark/Shark/Cassandra Stack Node1 Cassandra Node2 Cassandra Node3 Cassandra Tuesday, June 18, 13
  82. 82. Our Spark/Shark/Cassandra Stack Node1 Cassandra InputFormat SerDe Node2 Cassandra InputFormat SerDe Node3 Cassandra InputFormat SerDe Tuesday, June 18, 13
  83. 83. Our Spark/Shark/Cassandra Stack Node1 Cassandra InputFormat SerDe Spark Worker Shark Node2 Cassandra InputFormat SerDe Spark Worker Shark Node3 Cassandra InputFormat SerDe Spark Worker Shark Tuesday, June 18, 13
  84. 84. Our Spark/Shark/Cassandra Stack Node1 Cassandra InputFormat SerDe Spark Worker Shark Node2 Cassandra InputFormat SerDe Spark Worker Shark Node3 Cassandra InputFormat SerDe Spark Worker Shark Spark Master Tuesday, June 18, 13
  85. 85. Our Spark/Shark/Cassandra Stack Node1 Cassandra InputFormat SerDe Spark Worker Shark Node2 Cassandra InputFormat SerDe Spark Worker Shark Node3 Cassandra InputFormat SerDe Spark Worker Shark Spark Master Job Server Tuesday, June 18, 13
  86. 86. Event Store Cassandra schema t0 t1 t2 t3 t4 2013-04-05 T00:00Z#id1 {event0: a0} {event1: a1} {event2: a2} {event3: a3} {event4: a4} Event CF Tuesday, June 18, 13
  87. 87. Event Store Cassandra schema t0 t1 t2 t3 t4 2013-04-05 T00:00Z#id1 {event0: a0} {event1: a1} {event2: a2} {event3: a3} {event4: a4} ipaddr:10.20.30.40:t1 videoId:45678:t1 providerId:500:t0 2013-04-05 T00:00Z#id1 Event CF EventAttr CF Tuesday, June 18, 13
  88. 88. Unpacking raw events t0 t1 2013-04-05 T00:00Z#id1 {video: 10, type:5} {video: 11, type:1} 2013-04-05 T00:00Z#id2 {video: 20, type:5} {video: 25, type:9} UserID Video Type id1 10 5 Tuesday, June 18, 13
  89. 89. Unpacking raw events t0 t1 2013-04-05 T00:00Z#id1 {video: 10, type:5} {video: 11, type:1} 2013-04-05 T00:00Z#id2 {video: 20, type:5} {video: 25, type:9} UserID Video Type id1 10 5 id1 11 1 Tuesday, June 18, 13
  90. 90. Unpacking raw events t0 t1 2013-04-05 T00:00Z#id1 {video: 10, type:5} {video: 11, type:1} 2013-04-05 T00:00Z#id2 {video: 20, type:5} {video: 25, type:9} UserID Video Type id1 10 5 id1 11 1 id2 20 5 Tuesday, June 18, 13
  91. 91. Unpacking raw events t0 t1 2013-04-05 T00:00Z#id1 {video: 10, type:5} {video: 11, type:1} 2013-04-05 T00:00Z#id2 {video: 20, type:5} {video: 25, type:9} UserID Video Type id1 10 5 id1 11 1 id2 20 5 id2 25 9 Tuesday, June 18, 13
  92. 92. Tips for InputFormat Development Tuesday, June 18, 13
  93. 93. Tips for InputFormat Development • Know which target platforms you are developing for – Which API to write against? New? Old? Both? Tuesday, June 18, 13
  94. 94. Tips for InputFormat Development • Know which target platforms you are developing for – Which API to write against? New? Old? Both? • Be prepared to spend time tuning your split computation – Low latency jobs require fast splits Tuesday, June 18, 13
  95. 95. Tips for InputFormat Development • Know which target platforms you are developing for – Which API to write against? New? Old? Both? • Be prepared to spend time tuning your split computation – Low latency jobs require fast splits • Consider sorting row keys by token for data locality Tuesday, June 18, 13
  96. 96. Tips for InputFormat Development • Know which target platforms you are developing for – Which API to write against? New? Old? Both? • Be prepared to spend time tuning your split computation – Low latency jobs require fast splits • Consider sorting row keys by token for data locality • Implement predicate pushdown for HIVE SerDe’s – Use your indexes to reduce size of dataset Tuesday, June 18, 13
  97. 97. Example: OLAP processing t0 2013-04 -05T00: 00Z#id1 {video: 10, type:5} 2013-04 -05T00: 00Z#id2 {video: 20, type:5} C* events Tuesday, June 18, 13
  98. 98. Example: OLAP processing t0 2013-04 -05T00: 00Z#id1 {video: 10, type:5} 2013-04 -05T00: 00Z#id2 {video: 20, type:5} C* events OLAP Aggregates OLAP Aggregates OLAP Aggregates Cached Materialized Views Spark Spark Spark Tuesday, June 18, 13
  99. 99. Example: OLAP processing t0 2013-04 -05T00: 00Z#id1 {video: 10, type:5} 2013-04 -05T00: 00Z#id2 {video: 20, type:5} C* events OLAP Aggregates OLAP Aggregates OLAP Aggregates Cached Materialized Views Spark Spark Spark Union Tuesday, June 18, 13
  100. 100. Example: OLAP processing t0 2013-04 -05T00: 00Z#id1 {video: 10, type:5} 2013-04 -05T00: 00Z#id2 {video: 20, type:5} C* events OLAP Aggregates OLAP Aggregates OLAP Aggregates Cached Materialized Views Spark Spark Spark Union Query 1: Plays by Provider Tuesday, June 18, 13
  101. 101. Example: OLAP processing t0 2013-04 -05T00: 00Z#id1 {video: 10, type:5} 2013-04 -05T00: 00Z#id2 {video: 20, type:5} C* events OLAP Aggregates OLAP Aggregates OLAP Aggregates Cached Materialized Views Spark Spark Spark Union Query 1: Plays by Provider Query 2: Top content for mobile Tuesday, June 18, 13
  102. 102. Performance numbers 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Tuesday, June 18, 13
  103. 103. Performance numbers Spark: C* -> OLAP aggregates cold cache, 1.4 million events 130 seconds C* -> OLAP aggregates warmed cache 20-30 seconds OLAP aggregate query via Spark (56k records) 60 ms 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Tuesday, June 18, 13
  104. 104. Performance numbers Spark: C* -> OLAP aggregates cold cache, 1.4 million events 130 seconds C* -> OLAP aggregates warmed cache 20-30 seconds OLAP aggregate query via Spark (56k records) 60 ms 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Tuesday, June 18, 13
  105. 105. Performance numbers Spark: C* -> OLAP aggregates cold cache, 1.4 million events 130 seconds C* -> OLAP aggregates warmed cache 20-30 seconds OLAP aggregate query via Spark (56k records) 60 ms 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Tuesday, June 18, 13
  106. 106. OLAP WorkFlow Aggregation JobSpark Executors REST Job Server Aggregate Tuesday, June 18, 13
  107. 107. OLAP WorkFlow Aggregation JobSpark Executors Cassandra REST Job Server Aggregate Tuesday, June 18, 13
  108. 108. OLAP WorkFlow DatasetAggregation JobSpark Executors Cassandra REST Job Server Aggregate Tuesday, June 18, 13
  109. 109. OLAP WorkFlow DatasetAggregation Job Query JobSpark Executors Cassandra REST Job Server Aggregate Query Tuesday, June 18, 13
  110. 110. OLAP WorkFlow DatasetAggregation Job Query JobSpark Executors Cassandra REST Job Server Aggregate Query Result Tuesday, June 18, 13
  111. 111. OLAP WorkFlow DatasetAggregation Job Query JobSpark Executors Cassandra REST Job Server Query Job Aggregate Query Result Query Result Tuesday, June 18, 13
  112. 112. Fault Tolerance Tuesday, June 18, 13
  113. 113. Fault Tolerance • Cached dataset lives in Java Heap only - what if process dies? Tuesday, June 18, 13
  114. 114. Fault Tolerance • Cached dataset lives in Java Heap only - what if process dies? • Spark lineage - automatic recomputation from source, but this is expensive! Tuesday, June 18, 13
  115. 115. Fault Tolerance • Cached dataset lives in Java Heap only - what if process dies? • Spark lineage - automatic recomputation from source, but this is expensive! • Can also replicate cached dataset to survive single node failures Tuesday, June 18, 13
  116. 116. Fault Tolerance • Cached dataset lives in Java Heap only - what if process dies? • Spark lineage - automatic recomputation from source, but this is expensive! • Can also replicate cached dataset to survive single node failures • Persist materialized views back to C*, then load into cache -- now recovery path is much faster Tuesday, June 18, 13
  117. 117. Fault Tolerance • Cached dataset lives in Java Heap only - what if process dies? • Spark lineage - automatic recomputation from source, but this is expensive! • Can also replicate cached dataset to survive single node failures • Persist materialized views back to C*, then load into cache -- now recovery path is much faster • Persistence also enables multiple processes to hold cached dataset Tuesday, June 18, 13
  118. 118. Demo time Tuesday, June 18, 13
  119. 119. Shark Demo • Local shark node, 1 core, MBP • How to create a table from C* using our inputformat • Creating a cached Shark table • Running fast queries Tuesday, June 18, 13
  120. 120. Creating a Shark Table from InputFormat Tuesday, June 18, 13
  121. 121. Creating a cached table Tuesday, June 18, 13
  122. 122. Querying cached table Tuesday, June 18, 13
  123. 123. THANK YOU Tuesday, June 18, 13
  124. 124. THANK YOU • @evanfchan Tuesday, June 18, 13
  125. 125. THANK YOU • @evanfchan • ev@ooyala.com Tuesday, June 18, 13
  126. 126. THANK YOU • @evanfchan • ev@ooyala.com Tuesday, June 18, 13
  127. 127. THANK YOU • @evanfchan • ev@ooyala.com • WE ARE HIRING!! Tuesday, June 18, 13
  128. 128. Spark: Under the hood Map DatasetReduce Map Driver Map DatasetReduce Map Map DatasetReduce Map One executor process per node Driver Tuesday, June 18, 13

×