Real-time Analytics with Cassandra, Spark, and Shark

23,563 views
23,167 views

Published on

"Real-time Analytics with Cassandra, Spark, and Shark", my talk at the Cassandra Summit 2013 conference in San Francisco

Published in: Technology
0 Comments
58 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
23,563
On SlideShare
0
From Embeds
0
Number of Embeds
453
Actions
Shares
0
Downloads
637
Comments
0
Likes
58
Embeds 0
No embeds

No notes for slide

Real-time Analytics with Cassandra, Spark, and Shark

  1. 1. Real-time Analytics withCassandra, Spark and SharkTuesday, June 18, 13
  2. 2. Who is this guy• Staff Engineer, Compute and Data Services, Ooyala• Building multiple web-scale real-time systems on top of C*, Kafka,Storm, etc.• Scala/Akka guy• Very excited by open source, big data projects - share some today• @evanfchanTuesday, June 18, 13
  3. 3. AgendaTuesday, June 18, 13
  4. 4. Agenda• Ooyala and CassandraTuesday, June 18, 13
  5. 5. Agenda• Ooyala and Cassandra• What problem are we trying to solve?Tuesday, June 18, 13
  6. 6. Agenda• Ooyala and Cassandra• What problem are we trying to solve?• Spark and SharkTuesday, June 18, 13
  7. 7. Agenda• Ooyala and Cassandra• What problem are we trying to solve?• Spark and Shark• Our Spark/Cassandra ArchitectureTuesday, June 18, 13
  8. 8. Agenda• Ooyala and Cassandra• What problem are we trying to solve?• Spark and Shark• Our Spark/Cassandra Architecture• DemoTuesday, June 18, 13
  9. 9. Cassandra at OoyalaWho is Ooyala, and how we use CassandraTuesday, June 18, 13
  10. 10. CONFIDENTIAL—DO NOT DISTRIBUTEOOYALAPowering personalized videoexperiences across all screens.5Tuesday, June 18, 13
  11. 11. CONFIDENTIAL—DO NOT DISTRIBUTE 6CONFIDENTIAL—DO NOT DISTRIBUTEFounded in 2007Commercially launch in 2009230+ employees in Silicon Valley, LA, NYC,London, Paris, Tokyo, Sydney & GuadalajaraGlobal footprint, 200M unique users,110+ countries, and more than 6,000 websitesOver 1 billion videos played per monthand 2 billion analytic events per day25% of U.S. online viewers watch videopowered by OoyalaCOMPANY OVERVIEWTuesday, June 18, 13
  12. 12. CONFIDENTIAL—DO NOT DISTRIBUTE 7TRUSTED VIDEO PARTNERSTRATEGIC PARTNERSCUSTOMERSCONFIDENTIAL—DO NOT DISTRIBUTETuesday, June 18, 13
  13. 13. We are a large Cassandra userTuesday, June 18, 13
  14. 14. We are a large Cassandra user• 11 clusters ranging in size from 3 to 36 nodesTuesday, June 18, 13
  15. 15. We are a large Cassandra user• 11 clusters ranging in size from 3 to 36 nodes• Total of 28TB of data managed over ~85 nodesTuesday, June 18, 13
  16. 16. We are a large Cassandra user• 11 clusters ranging in size from 3 to 36 nodes• Total of 28TB of data managed over ~85 nodes• Over 2 billion C* column writes per dayTuesday, June 18, 13
  17. 17. We are a large Cassandra user• 11 clusters ranging in size from 3 to 36 nodes• Total of 28TB of data managed over ~85 nodes• Over 2 billion C* column writes per day• Powers all of our analytics infrastructureTuesday, June 18, 13
  18. 18. We are a large Cassandra user• 11 clusters ranging in size from 3 to 36 nodes• Total of 28TB of data managed over ~85 nodes• Over 2 billion C* column writes per day• Powers all of our analytics infrastructure• Much much bigger cluster coming soonTuesday, June 18, 13
  19. 19. What problem are we trying tosolve?Lots of data, complex queries, answered really quickly... but how??Tuesday, June 18, 13
  20. 20. From mountains of useless data...Tuesday, June 18, 13
  21. 21. To nuggets of truth...Tuesday, June 18, 13
  22. 22. To nuggets of truth...• QuicklyTuesday, June 18, 13
  23. 23. To nuggets of truth...• Quickly• PainlesslyTuesday, June 18, 13
  24. 24. To nuggets of truth...• Quickly• Painlessly• At scale?Tuesday, June 18, 13
  25. 25. Today: Precomputed aggregatesTuesday, June 18, 13
  26. 26. Today: Precomputed aggregates• Video metrics computed along several high cardinality dimensionsTuesday, June 18, 13
  27. 27. Today: Precomputed aggregates• Video metrics computed along several high cardinality dimensions• Very fast lookups, but inflexible, and hard to changeTuesday, June 18, 13
  28. 28. Today: Precomputed aggregates• Video metrics computed along several high cardinality dimensions• Very fast lookups, but inflexible, and hard to change• Most computed aggregates are never readTuesday, June 18, 13
  29. 29. Today: Precomputed aggregates• Video metrics computed along several high cardinality dimensions• Very fast lookups, but inflexible, and hard to change• Most computed aggregates are never read• What if we need more dynamic queries?– Top content for mobile users in France– Engagement curves for users who watched recommendations– Data mining, trends, machine learningTuesday, June 18, 13
  30. 30. The static - dynamic continuum• Super fast lookups• Inflexible, wasteful• Best for 80% mostcommon queries• Always compute resultsfrom raw data• Flexible but slow100% Precomputation 100% DynamicTuesday, June 18, 13
  31. 31. The static - dynamic continuum• Super fast lookups• Inflexible, wasteful• Best for 80% mostcommon queries• Always compute resultsfrom raw data• Flexible but slow100% Precomputation100% DynamicTuesday, June 18, 13
  32. 32. Where we want to bePartly dynamic• Pre-aggregate mostcommon queries• Flexible, fast dynamicqueries• Easily generate manymaterialized viewsTuesday, June 18, 13
  33. 33. Industry TrendsTuesday, June 18, 13
  34. 34. Industry Trends• Fast execution frameworks– ImpalaTuesday, June 18, 13
  35. 35. Industry Trends• Fast execution frameworks– Impala• In-memory databases– VoltDB, DruidTuesday, June 18, 13
  36. 36. Industry Trends• Fast execution frameworks– Impala• In-memory databases– VoltDB, Druid• Streaming and real-timeTuesday, June 18, 13
  37. 37. Industry Trends• Fast execution frameworks– Impala• In-memory databases– VoltDB, Druid• Streaming and real-time• Higher-level, productive data frameworks– Cascading, Hive, PigTuesday, June 18, 13
  38. 38. Why Spark and Shark?“Lightning-fast in-memory cluster computing”Tuesday, June 18, 13
  39. 39. Introduction to SparkTuesday, June 18, 13
  40. 40. Introduction to Spark• In-memory distributed computing frameworkTuesday, June 18, 13
  41. 41. Introduction to Spark• In-memory distributed computing framework• Created by UC Berkeley AMP Lab in 2010Tuesday, June 18, 13
  42. 42. Introduction to Spark• In-memory distributed computing framework• Created by UC Berkeley AMP Lab in 2010• Targeted problems that MR is bad at:– Iterative algorithms (machine learning)– Interactive data miningTuesday, June 18, 13
  43. 43. Introduction to Spark• In-memory distributed computing framework• Created by UC Berkeley AMP Lab in 2010• Targeted problems that MR is bad at:– Iterative algorithms (machine learning)– Interactive data mining• More general purpose than Hadoop MRTuesday, June 18, 13
  44. 44. Introduction to Spark• In-memory distributed computing framework• Created by UC Berkeley AMP Lab in 2010• Targeted problems that MR is bad at:– Iterative algorithms (machine learning)– Interactive data mining• More general purpose than Hadoop MR• Active contributions from ~ 15 companiesTuesday, June 18, 13
  45. 45. HDFSMapReduceMapReduceTuesday, June 18, 13
  46. 46. HDFSMapReduceMapReduceTuesday, June 18, 13
  47. 47. HDFSMapReduceMapReduceData Sourcemap()join()Source 2Tuesday, June 18, 13
  48. 48. HDFSMapReduceMapReduceData Sourcemap()join()Source 2cache()Tuesday, June 18, 13
  49. 49. HDFSMapReduceMapReduceData Sourcemap()join()Source 2cache()transformTuesday, June 18, 13
  50. 50. Throughput: Memory is king6-node C*/DSE 1.1.9 cluster,Spark 0.7.0Tuesday, June 18, 13
  51. 51. Throughput: Memory is king0 37500 75000 112500 150000C*, cold cacheC*, warm cacheSpark RDD6-node C*/DSE 1.1.9 cluster,Spark 0.7.0Tuesday, June 18, 13
  52. 52. Throughput: Memory is king0 37500 75000 112500 150000C*, cold cacheC*, warm cacheSpark RDD6-node C*/DSE 1.1.9 cluster,Spark 0.7.0Tuesday, June 18, 13
  53. 53. Throughput: Memory is king0 37500 75000 112500 150000C*, cold cacheC*, warm cacheSpark RDD6-node C*/DSE 1.1.9 cluster,Spark 0.7.0Tuesday, June 18, 13
  54. 54. Throughput: Memory is king0 37500 75000 112500 150000C*, cold cacheC*, warm cacheSpark RDD6-node C*/DSE 1.1.9 cluster,Spark 0.7.0Tuesday, June 18, 13
  55. 55. Developers love itTuesday, June 18, 13
  56. 56. Developers love it• “I wrote my first aggregation job in 30 minutes”Tuesday, June 18, 13
  57. 57. Developers love it• “I wrote my first aggregation job in 30 minutes”• High level “distributed collections” APITuesday, June 18, 13
  58. 58. Developers love it• “I wrote my first aggregation job in 30 minutes”• High level “distributed collections” API• No Hadoop cruftTuesday, June 18, 13
  59. 59. Developers love it• “I wrote my first aggregation job in 30 minutes”• High level “distributed collections” API• No Hadoop cruft• Full power of Scala, Java, PythonTuesday, June 18, 13
  60. 60. Developers love it• “I wrote my first aggregation job in 30 minutes”• High level “distributed collections” API• No Hadoop cruft• Full power of Scala, Java, Python• Interactive REPL shellTuesday, June 18, 13
  61. 61. Developers love it• “I wrote my first aggregation job in 30 minutes”• High level “distributed collections” API• No Hadoop cruft• Full power of Scala, Java, Python• Interactive REPL shell• EASY testing!!Tuesday, June 18, 13
  62. 62. Developers love it• “I wrote my first aggregation job in 30 minutes”• High level “distributed collections” API• No Hadoop cruft• Full power of Scala, Java, Python• Interactive REPL shell• EASY testing!!• Low latency - quick development cyclesTuesday, June 18, 13
  63. 63. Spark word count example1 package org.myorg;23 import java.io.IOException;4 import java.util.*;56 import org.apache.hadoop.fs.Path;7 import org.apache.hadoop.conf.*;8 import org.apache.hadoop.io.*;9 import org.apache.hadoop.mapreduce.*;10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;1415 public class WordCount {1617 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {18 private final static IntWritable one = new IntWritable(1);19 private Text word = new Text();2021 public void map(LongWritable key, Text value, Context context) throws IOException,InterruptedException {22 String line = value.toString();23 StringTokenizer tokenizer = new StringTokenizer(line);24 while (tokenizer.hasMoreTokens()) {25 word.set(tokenizer.nextToken());26 context.write(word, one);27 }28 }29 }3031 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {3233 public void reduce(Text key, Iterable<IntWritable> values, Context context)34 throws IOException, InterruptedException {35 int sum = 0;36 for (IntWritable val : values) {37 sum += val.get();38 }39 context.write(key, new IntWritable(sum));40 }41 }4243 public static void main(String[] args) throws Exception {44 Configuration conf = new Configuration();4546 Job job = new Job(conf, "wordcount");4748 job.setOutputKeyClass(Text.class);49 job.setOutputValueClass(IntWritable.class);5051 job.setMapperClass(Map.class);52 job.setReducerClass(Reduce.class);5354 job.setInputFormatClass(TextInputFormat.class);55 job.setOutputFormatClass(TextOutputFormat.class);5657 FileInputFormat.addInputPath(job, new Path(args[0]));58 FileOutputFormat.setOutputPath(job, new Path(args[1]));5960 job.waitForCompletion(true);61 }6263 }Tuesday, June 18, 13
  64. 64. Spark word count examplefile = spark.textFile("hdfs://...") file.flatMap(line => line.split(" "))    .map(word => (word, 1))    .reduceByKey(_ + _)1 package org.myorg;23 import java.io.IOException;4 import java.util.*;56 import org.apache.hadoop.fs.Path;7 import org.apache.hadoop.conf.*;8 import org.apache.hadoop.io.*;9 import org.apache.hadoop.mapreduce.*;10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;1415 public class WordCount {1617 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {18 private final static IntWritable one = new IntWritable(1);19 private Text word = new Text();2021 public void map(LongWritable key, Text value, Context context) throws IOException,InterruptedException {22 String line = value.toString();23 StringTokenizer tokenizer = new StringTokenizer(line);24 while (tokenizer.hasMoreTokens()) {25 word.set(tokenizer.nextToken());26 context.write(word, one);27 }28 }29 }3031 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {3233 public void reduce(Text key, Iterable<IntWritable> values, Context context)34 throws IOException, InterruptedException {35 int sum = 0;36 for (IntWritable val : values) {37 sum += val.get();38 }39 context.write(key, new IntWritable(sum));40 }41 }4243 public static void main(String[] args) throws Exception {44 Configuration conf = new Configuration();4546 Job job = new Job(conf, "wordcount");4748 job.setOutputKeyClass(Text.class);49 job.setOutputValueClass(IntWritable.class);5051 job.setMapperClass(Map.class);52 job.setReducerClass(Reduce.class);5354 job.setInputFormatClass(TextInputFormat.class);55 job.setOutputFormatClass(TextOutputFormat.class);5657 FileInputFormat.addInputPath(job, new Path(args[0]));58 FileOutputFormat.setOutputPath(job, new Path(args[1]));5960 job.waitForCompletion(true);61 }6263 }Tuesday, June 18, 13
  65. 65. The Spark EcosystemSparkTachyon - in-memory caching DFSTuesday, June 18, 13
  66. 66. The Spark EcosystemBagel -Pregel onSparkSparkTachyon - in-memory caching DFSTuesday, June 18, 13
  67. 67. The Spark EcosystemBagel -Pregel onSparkHIVE on SparkSparkTachyon - in-memory caching DFSTuesday, June 18, 13
  68. 68. The Spark EcosystemBagel -Pregel onSparkHIVE on SparkSpark Streaming -discretized streamprocessingSparkTachyon - in-memory caching DFSTuesday, June 18, 13
  69. 69. Shark - HIVE on SparkTuesday, June 18, 13
  70. 70. Shark - HIVE on Spark• 100% HiveQL compatibleTuesday, June 18, 13
  71. 71. Shark - HIVE on Spark• 100% HiveQL compatible• 10-100x faster than HIVE, answers in secondsTuesday, June 18, 13
  72. 72. Shark - HIVE on Spark• 100% HiveQL compatible• 10-100x faster than HIVE, answers in seconds• Reuse UDFs, SerDe’s, StorageHandlersTuesday, June 18, 13
  73. 73. Shark - HIVE on Spark• 100% HiveQL compatible• 10-100x faster than HIVE, answers in seconds• Reuse UDFs, SerDe’s, StorageHandlers• Can use DSE / CassandraFS for MetastoreTuesday, June 18, 13
  74. 74. Shark - HIVE on Spark• 100% HiveQL compatible• 10-100x faster than HIVE, answers in seconds• Reuse UDFs, SerDe’s, StorageHandlers• Can use DSE / CassandraFS for Metastore• Easy Scala/Java integration via Spark - easier thanwriting UDFsTuesday, June 18, 13
  75. 75. Our new analytics architectureHow we integrate Cassandra and Spark/SharkTuesday, June 18, 13
  76. 76. From raw events to fast queriesRawEventsRawEventsRawEventsTuesday, June 18, 13
  77. 77. From raw events to fast queriesIngestionC*eventstoreRawEventsRawEventsRawEventsTuesday, June 18, 13
  78. 78. From raw events to fast queriesIngestionC*eventstoreRawEventsRawEventsRawEventsSparkSparkSparkView 1View 2View 3Tuesday, June 18, 13
  79. 79. From raw events to fast queriesIngestionC*eventstoreRawEventsRawEventsRawEventsSparkSparkSparkView 1View 2View 3SparkPredefinedqueriesTuesday, June 18, 13
  80. 80. From raw events to fast queriesIngestionC*eventstoreRawEventsRawEventsRawEventsSparkSparkSparkView 1View 2View 3SparkSharkPredefinedqueriesAd-hocHiveQLTuesday, June 18, 13
  81. 81. Our Spark/Shark/Cassandra StackNode1CassandraNode2CassandraNode3CassandraTuesday, June 18, 13
  82. 82. Our Spark/Shark/Cassandra StackNode1CassandraInputFormatSerDeNode2CassandraInputFormatSerDeNode3CassandraInputFormatSerDeTuesday, June 18, 13
  83. 83. Our Spark/Shark/Cassandra StackNode1CassandraInputFormatSerDeSparkWorkerSharkNode2CassandraInputFormatSerDeSparkWorkerSharkNode3CassandraInputFormatSerDeSparkWorkerSharkTuesday, June 18, 13
  84. 84. Our Spark/Shark/Cassandra StackNode1CassandraInputFormatSerDeSparkWorkerSharkNode2CassandraInputFormatSerDeSparkWorkerSharkNode3CassandraInputFormatSerDeSparkWorkerSharkSpark MasterTuesday, June 18, 13
  85. 85. Our Spark/Shark/Cassandra StackNode1CassandraInputFormatSerDeSparkWorkerSharkNode2CassandraInputFormatSerDeSparkWorkerSharkNode3CassandraInputFormatSerDeSparkWorkerSharkSpark Master Job ServerTuesday, June 18, 13
  86. 86. Event Store Cassandra schemat0 t1 t2 t3 t42013-04-05T00:00Z#id1{event0:a0}{event1:a1}{event2:a2}{event3:a3}{event4:a4}Event CFTuesday, June 18, 13
  87. 87. Event Store Cassandra schemat0 t1 t2 t3 t42013-04-05T00:00Z#id1{event0:a0}{event1:a1}{event2:a2}{event3:a3}{event4:a4}ipaddr:10.20.30.40:t1 videoId:45678:t1 providerId:500:t02013-04-05T00:00Z#id1Event CFEventAttr CFTuesday, June 18, 13
  88. 88. Unpacking raw eventst0 t12013-04-05T00:00Z#id1{video: 10,type:5}{video: 11,type:1}2013-04-05T00:00Z#id2{video: 20,type:5}{video: 25,type:9}UserID Video Typeid1 10 5Tuesday, June 18, 13
  89. 89. Unpacking raw eventst0 t12013-04-05T00:00Z#id1{video: 10,type:5}{video: 11,type:1}2013-04-05T00:00Z#id2{video: 20,type:5}{video: 25,type:9}UserID Video Typeid1 10 5id1 11 1Tuesday, June 18, 13
  90. 90. Unpacking raw eventst0 t12013-04-05T00:00Z#id1{video: 10,type:5}{video: 11,type:1}2013-04-05T00:00Z#id2{video: 20,type:5}{video: 25,type:9}UserID Video Typeid1 10 5id1 11 1id2 20 5Tuesday, June 18, 13
  91. 91. Unpacking raw eventst0 t12013-04-05T00:00Z#id1{video: 10,type:5}{video: 11,type:1}2013-04-05T00:00Z#id2{video: 20,type:5}{video: 25,type:9}UserID Video Typeid1 10 5id1 11 1id2 20 5id2 25 9Tuesday, June 18, 13
  92. 92. Tips for InputFormat DevelopmentTuesday, June 18, 13
  93. 93. Tips for InputFormat Development• Know which target platforms you are developing for– Which API to write against? New? Old? Both?Tuesday, June 18, 13
  94. 94. Tips for InputFormat Development• Know which target platforms you are developing for– Which API to write against? New? Old? Both?• Be prepared to spend time tuning your split computation– Low latency jobs require fast splitsTuesday, June 18, 13
  95. 95. Tips for InputFormat Development• Know which target platforms you are developing for– Which API to write against? New? Old? Both?• Be prepared to spend time tuning your split computation– Low latency jobs require fast splits• Consider sorting row keys by token for data localityTuesday, June 18, 13
  96. 96. Tips for InputFormat Development• Know which target platforms you are developing for– Which API to write against? New? Old? Both?• Be prepared to spend time tuning your split computation– Low latency jobs require fast splits• Consider sorting row keys by token for data locality• Implement predicate pushdown for HIVE SerDe’s– Use your indexes to reduce size of datasetTuesday, June 18, 13
  97. 97. Example: OLAP processingt02013-04-05T00:00Z#id1{video:10,type:5}2013-04-05T00:00Z#id2{video:20,type:5}C* eventsTuesday, June 18, 13
  98. 98. Example: OLAP processingt02013-04-05T00:00Z#id1{video:10,type:5}2013-04-05T00:00Z#id2{video:20,type:5}C* eventsOLAPAggregatesOLAPAggregatesOLAPAggregatesCached Materialized ViewsSparkSparkSparkTuesday, June 18, 13
  99. 99. Example: OLAP processingt02013-04-05T00:00Z#id1{video:10,type:5}2013-04-05T00:00Z#id2{video:20,type:5}C* eventsOLAPAggregatesOLAPAggregatesOLAPAggregatesCached Materialized ViewsSparkSparkSparkUnionTuesday, June 18, 13
  100. 100. Example: OLAP processingt02013-04-05T00:00Z#id1{video:10,type:5}2013-04-05T00:00Z#id2{video:20,type:5}C* eventsOLAPAggregatesOLAPAggregatesOLAPAggregatesCached Materialized ViewsSparkSparkSparkUnionQuery 1: Playsby ProviderTuesday, June 18, 13
  101. 101. Example: OLAP processingt02013-04-05T00:00Z#id1{video:10,type:5}2013-04-05T00:00Z#id2{video:20,type:5}C* eventsOLAPAggregatesOLAPAggregatesOLAPAggregatesCached Materialized ViewsSparkSparkSparkUnionQuery 1: Playsby ProviderQuery 2: Topcontent formobileTuesday, June 18, 13
  102. 102. Performance numbers6-node C*/DSE 1.1.9 cluster,Spark 0.7.0Tuesday, June 18, 13
  103. 103. Performance numbersSpark: C* -> OLAP aggregatescold cache, 1.4 million events130 secondsC* -> OLAP aggregateswarmed cache20-30 secondsOLAP aggregate query via Spark(56k records)60 ms6-node C*/DSE 1.1.9 cluster,Spark 0.7.0Tuesday, June 18, 13
  104. 104. Performance numbersSpark: C* -> OLAP aggregatescold cache, 1.4 million events130 secondsC* -> OLAP aggregateswarmed cache20-30 secondsOLAP aggregate query via Spark(56k records)60 ms6-node C*/DSE 1.1.9 cluster,Spark 0.7.0Tuesday, June 18, 13
  105. 105. Performance numbersSpark: C* -> OLAP aggregatescold cache, 1.4 million events130 secondsC* -> OLAP aggregateswarmed cache20-30 secondsOLAP aggregate query via Spark(56k records)60 ms6-node C*/DSE 1.1.9 cluster,Spark 0.7.0Tuesday, June 18, 13
  106. 106. OLAP WorkFlowAggregation JobSparkExecutorsREST Job ServerAggregateTuesday, June 18, 13
  107. 107. OLAP WorkFlowAggregation JobSparkExecutorsCassandraREST Job ServerAggregateTuesday, June 18, 13
  108. 108. OLAP WorkFlowDatasetAggregation JobSparkExecutorsCassandraREST Job ServerAggregateTuesday, June 18, 13
  109. 109. OLAP WorkFlowDatasetAggregation Job Query JobSparkExecutorsCassandraREST Job ServerAggregate QueryTuesday, June 18, 13
  110. 110. OLAP WorkFlowDatasetAggregation Job Query JobSparkExecutorsCassandraREST Job ServerAggregate QueryResultTuesday, June 18, 13
  111. 111. OLAP WorkFlowDatasetAggregation Job Query JobSparkExecutorsCassandraREST Job ServerQuery JobAggregate QueryResultQueryResultTuesday, June 18, 13
  112. 112. Fault ToleranceTuesday, June 18, 13
  113. 113. Fault Tolerance• Cached dataset lives in Java Heap only - what if process dies?Tuesday, June 18, 13
  114. 114. Fault Tolerance• Cached dataset lives in Java Heap only - what if process dies?• Spark lineage - automatic recomputation from source, but this isexpensive!Tuesday, June 18, 13
  115. 115. Fault Tolerance• Cached dataset lives in Java Heap only - what if process dies?• Spark lineage - automatic recomputation from source, but this isexpensive!• Can also replicate cached dataset to survive single node failuresTuesday, June 18, 13
  116. 116. Fault Tolerance• Cached dataset lives in Java Heap only - what if process dies?• Spark lineage - automatic recomputation from source, but this isexpensive!• Can also replicate cached dataset to survive single node failures• Persist materialized views back to C*, then load into cache -- nowrecovery path is much fasterTuesday, June 18, 13
  117. 117. Fault Tolerance• Cached dataset lives in Java Heap only - what if process dies?• Spark lineage - automatic recomputation from source, but this isexpensive!• Can also replicate cached dataset to survive single node failures• Persist materialized views back to C*, then load into cache -- nowrecovery path is much faster• Persistence also enables multiple processes to hold cached datasetTuesday, June 18, 13
  118. 118. Demo timeTuesday, June 18, 13
  119. 119. Shark Demo• Local shark node, 1 core, MBP• How to create a table from C* using our inputformat• Creating a cached Shark table• Running fast queriesTuesday, June 18, 13
  120. 120. Creating a Shark Table from InputFormatTuesday, June 18, 13
  121. 121. Creating a cached tableTuesday, June 18, 13
  122. 122. Querying cached tableTuesday, June 18, 13
  123. 123. THANK YOUTuesday, June 18, 13
  124. 124. THANK YOU• @evanfchanTuesday, June 18, 13
  125. 125. THANK YOU• @evanfchan• ev@ooyala.comTuesday, June 18, 13
  126. 126. THANK YOU• @evanfchan• ev@ooyala.comTuesday, June 18, 13
  127. 127. THANK YOU• @evanfchan• ev@ooyala.com• WE ARE HIRING!!Tuesday, June 18, 13
  128. 128. Spark: Under the hoodMap DatasetReduce MapDriver Map DatasetReduce MapMap DatasetReduce MapOne executor process per nodeDriverTuesday, June 18, 13

×