Real-time Analytics with Cassandra, Spark, and Shark

Evan Chan
Evan ChanSoftware Engineer at Apple
Real-time Analytics with
Cassandra, Spark and Shark
Tuesday, June 18, 13
Who is this guy
• Staff Engineer, Compute and Data Services, Ooyala
• Building multiple web-scale real-time systems on top of C*, Kafka,
Storm, etc.
• Scala/Akka guy
• Very excited by open source, big data projects - share some today
• @evanfchan
Tuesday, June 18, 13
Agenda
Tuesday, June 18, 13
Agenda
• Ooyala and Cassandra
Tuesday, June 18, 13
Agenda
• Ooyala and Cassandra
• What problem are we trying to solve?
Tuesday, June 18, 13
Agenda
• Ooyala and Cassandra
• What problem are we trying to solve?
• Spark and Shark
Tuesday, June 18, 13
Agenda
• Ooyala and Cassandra
• What problem are we trying to solve?
• Spark and Shark
• Our Spark/Cassandra Architecture
Tuesday, June 18, 13
Agenda
• Ooyala and Cassandra
• What problem are we trying to solve?
• Spark and Shark
• Our Spark/Cassandra Architecture
• Demo
Tuesday, June 18, 13
Cassandra at Ooyala
Who is Ooyala, and how we use Cassandra
Tuesday, June 18, 13
CONFIDENTIAL—DO NOT DISTRIBUTE
OOYALA
Powering personalized video
experiences across all screens.
5
Tuesday, June 18, 13
CONFIDENTIAL—DO NOT DISTRIBUTE 6CONFIDENTIAL—DO NOT DISTRIBUTE
Founded in 2007
Commercially launch in 2009
230+ employees in Silicon Valley, LA, NYC,
London, Paris, Tokyo, Sydney & Guadalajara
Global footprint, 200M unique users,
110+ countries, and more than 6,000 websites
Over 1 billion videos played per month
and 2 billion analytic events per day
25% of U.S. online viewers watch video
powered by Ooyala
COMPANY OVERVIEW
Tuesday, June 18, 13
CONFIDENTIAL—DO NOT DISTRIBUTE 7
TRUSTED VIDEO PARTNER
STRATEGIC PARTNERS
CUSTOMERS
CONFIDENTIAL—DO NOT DISTRIBUTE
Tuesday, June 18, 13
We are a large Cassandra user
Tuesday, June 18, 13
We are a large Cassandra user
• 11 clusters ranging in size from 3 to 36 nodes
Tuesday, June 18, 13
We are a large Cassandra user
• 11 clusters ranging in size from 3 to 36 nodes
• Total of 28TB of data managed over ~85 nodes
Tuesday, June 18, 13
We are a large Cassandra user
• 11 clusters ranging in size from 3 to 36 nodes
• Total of 28TB of data managed over ~85 nodes
• Over 2 billion C* column writes per day
Tuesday, June 18, 13
We are a large Cassandra user
• 11 clusters ranging in size from 3 to 36 nodes
• Total of 28TB of data managed over ~85 nodes
• Over 2 billion C* column writes per day
• Powers all of our analytics infrastructure
Tuesday, June 18, 13
We are a large Cassandra user
• 11 clusters ranging in size from 3 to 36 nodes
• Total of 28TB of data managed over ~85 nodes
• Over 2 billion C* column writes per day
• Powers all of our analytics infrastructure
• Much much bigger cluster coming soon
Tuesday, June 18, 13
What problem are we trying to
solve?
Lots of data, complex queries, answered really quickly... but how??
Tuesday, June 18, 13
From mountains of useless data...
Tuesday, June 18, 13
To nuggets of truth...
Tuesday, June 18, 13
To nuggets of truth...
• Quickly
Tuesday, June 18, 13
To nuggets of truth...
• Quickly
• Painlessly
Tuesday, June 18, 13
To nuggets of truth...
• Quickly
• Painlessly
• At scale?
Tuesday, June 18, 13
Today: Precomputed aggregates
Tuesday, June 18, 13
Today: Precomputed aggregates
• Video metrics computed along several high cardinality dimensions
Tuesday, June 18, 13
Today: Precomputed aggregates
• Video metrics computed along several high cardinality dimensions
• Very fast lookups, but inflexible, and hard to change
Tuesday, June 18, 13
Today: Precomputed aggregates
• Video metrics computed along several high cardinality dimensions
• Very fast lookups, but inflexible, and hard to change
• Most computed aggregates are never read
Tuesday, June 18, 13
Today: Precomputed aggregates
• Video metrics computed along several high cardinality dimensions
• Very fast lookups, but inflexible, and hard to change
• Most computed aggregates are never read
• What if we need more dynamic queries?
– Top content for mobile users in France
– Engagement curves for users who watched recommendations
– Data mining, trends, machine learning
Tuesday, June 18, 13
The static - dynamic continuum
• Super fast lookups
• Inflexible, wasteful
• Best for 80% most
common queries
• Always compute results
from raw data
• Flexible but slow
100% Precomputation 100% Dynamic
Tuesday, June 18, 13
The static - dynamic continuum
• Super fast lookups
• Inflexible, wasteful
• Best for 80% most
common queries
• Always compute results
from raw data
• Flexible but slow
100% Precomputation100% Dynamic
Tuesday, June 18, 13
Where we want to be
Partly dynamic
• Pre-aggregate most
common queries
• Flexible, fast dynamic
queries
• Easily generate many
materialized views
Tuesday, June 18, 13
Industry Trends
Tuesday, June 18, 13
Industry Trends
• Fast execution frameworks
– Impala
Tuesday, June 18, 13
Industry Trends
• Fast execution frameworks
– Impala
• In-memory databases
– VoltDB, Druid
Tuesday, June 18, 13
Industry Trends
• Fast execution frameworks
– Impala
• In-memory databases
– VoltDB, Druid
• Streaming and real-time
Tuesday, June 18, 13
Industry Trends
• Fast execution frameworks
– Impala
• In-memory databases
– VoltDB, Druid
• Streaming and real-time
• Higher-level, productive data frameworks
– Cascading, Hive, Pig
Tuesday, June 18, 13
Why Spark and Shark?
“Lightning-fast in-memory cluster computing”
Tuesday, June 18, 13
Introduction to Spark
Tuesday, June 18, 13
Introduction to Spark
• In-memory distributed computing framework
Tuesday, June 18, 13
Introduction to Spark
• In-memory distributed computing framework
• Created by UC Berkeley AMP Lab in 2010
Tuesday, June 18, 13
Introduction to Spark
• In-memory distributed computing framework
• Created by UC Berkeley AMP Lab in 2010
• Targeted problems that MR is bad at:
– Iterative algorithms (machine learning)
– Interactive data mining
Tuesday, June 18, 13
Introduction to Spark
• In-memory distributed computing framework
• Created by UC Berkeley AMP Lab in 2010
• Targeted problems that MR is bad at:
– Iterative algorithms (machine learning)
– Interactive data mining
• More general purpose than Hadoop MR
Tuesday, June 18, 13
Introduction to Spark
• In-memory distributed computing framework
• Created by UC Berkeley AMP Lab in 2010
• Targeted problems that MR is bad at:
– Iterative algorithms (machine learning)
– Interactive data mining
• More general purpose than Hadoop MR
• Active contributions from ~ 15 companies
Tuesday, June 18, 13
HDFS
Map
Reduce
Map
Reduce
Tuesday, June 18, 13
HDFS
Map
Reduce
Map
Reduce
Tuesday, June 18, 13
HDFS
Map
Reduce
Map
Reduce
Data Source
map()
join()
Source 2
Tuesday, June 18, 13
HDFS
Map
Reduce
Map
Reduce
Data Source
map()
join()
Source 2
cache()
Tuesday, June 18, 13
HDFS
Map
Reduce
Map
Reduce
Data Source
map()
join()
Source 2
cache()
transform
Tuesday, June 18, 13
Throughput: Memory is king
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Tuesday, June 18, 13
Throughput: Memory is king
0 37500 75000 112500 150000
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Tuesday, June 18, 13
Throughput: Memory is king
0 37500 75000 112500 150000
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Tuesday, June 18, 13
Throughput: Memory is king
0 37500 75000 112500 150000
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Tuesday, June 18, 13
Throughput: Memory is king
0 37500 75000 112500 150000
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Tuesday, June 18, 13
Developers love it
Tuesday, June 18, 13
Developers love it
• “I wrote my first aggregation job in 30 minutes”
Tuesday, June 18, 13
Developers love it
• “I wrote my first aggregation job in 30 minutes”
• High level “distributed collections” API
Tuesday, June 18, 13
Developers love it
• “I wrote my first aggregation job in 30 minutes”
• High level “distributed collections” API
• No Hadoop cruft
Tuesday, June 18, 13
Developers love it
• “I wrote my first aggregation job in 30 minutes”
• High level “distributed collections” API
• No Hadoop cruft
• Full power of Scala, Java, Python
Tuesday, June 18, 13
Developers love it
• “I wrote my first aggregation job in 30 minutes”
• High level “distributed collections” API
• No Hadoop cruft
• Full power of Scala, Java, Python
• Interactive REPL shell
Tuesday, June 18, 13
Developers love it
• “I wrote my first aggregation job in 30 minutes”
• High level “distributed collections” API
• No Hadoop cruft
• Full power of Scala, Java, Python
• Interactive REPL shell
• EASY testing!!
Tuesday, June 18, 13
Developers love it
• “I wrote my first aggregation job in 30 minutes”
• High level “distributed collections” API
• No Hadoop cruft
• Full power of Scala, Java, Python
• Interactive REPL shell
• EASY testing!!
• Low latency - quick development cycles
Tuesday, June 18, 13
Spark word count example
1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
18 private final static IntWritable one = new IntWritable(1);
19 private Text word = new Text();
20
21 public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
22 String line = value.toString();
23 StringTokenizer tokenizer = new StringTokenizer(line);
24 while (tokenizer.hasMoreTokens()) {
25 word.set(tokenizer.nextToken());
26 context.write(word, one);
27 }
28 }
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
32
33 public void reduce(Text key, Iterable<IntWritable> values, Context context)
34 throws IOException, InterruptedException {
35 int sum = 0;
36 for (IntWritable val : values) {
37 sum += val.get();
38 }
39 context.write(key, new IntWritable(sum));
40 }
41 }
42
43 public static void main(String[] args) throws Exception {
44 Configuration conf = new Configuration();
45
46 Job job = new Job(conf, "wordcount");
47
48 job.setOutputKeyClass(Text.class);
49 job.setOutputValueClass(IntWritable.class);
50
51 job.setMapperClass(Map.class);
52 job.setReducerClass(Reduce.class);
53
54 job.setInputFormatClass(TextInputFormat.class);
55 job.setOutputFormatClass(TextOutputFormat.class);
56
57 FileInputFormat.addInputPath(job, new Path(args[0]));
58 FileOutputFormat.setOutputPath(job, new Path(args[1]));
59
60 job.waitForCompletion(true);
61 }
62
63 }
Tuesday, June 18, 13
Spark word count example
file = spark.textFile("hdfs://...")
 
file.flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _)
1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
18 private final static IntWritable one = new IntWritable(1);
19 private Text word = new Text();
20
21 public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
22 String line = value.toString();
23 StringTokenizer tokenizer = new StringTokenizer(line);
24 while (tokenizer.hasMoreTokens()) {
25 word.set(tokenizer.nextToken());
26 context.write(word, one);
27 }
28 }
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
32
33 public void reduce(Text key, Iterable<IntWritable> values, Context context)
34 throws IOException, InterruptedException {
35 int sum = 0;
36 for (IntWritable val : values) {
37 sum += val.get();
38 }
39 context.write(key, new IntWritable(sum));
40 }
41 }
42
43 public static void main(String[] args) throws Exception {
44 Configuration conf = new Configuration();
45
46 Job job = new Job(conf, "wordcount");
47
48 job.setOutputKeyClass(Text.class);
49 job.setOutputValueClass(IntWritable.class);
50
51 job.setMapperClass(Map.class);
52 job.setReducerClass(Reduce.class);
53
54 job.setInputFormatClass(TextInputFormat.class);
55 job.setOutputFormatClass(TextOutputFormat.class);
56
57 FileInputFormat.addInputPath(job, new Path(args[0]));
58 FileOutputFormat.setOutputPath(job, new Path(args[1]));
59
60 job.waitForCompletion(true);
61 }
62
63 }
Tuesday, June 18, 13
The Spark Ecosystem
Spark
Tachyon - in-memory caching DFS
Tuesday, June 18, 13
The Spark Ecosystem
Bagel -
Pregel on
Spark
Spark
Tachyon - in-memory caching DFS
Tuesday, June 18, 13
The Spark Ecosystem
Bagel -
Pregel on
Spark
HIVE on Spark
Spark
Tachyon - in-memory caching DFS
Tuesday, June 18, 13
The Spark Ecosystem
Bagel -
Pregel on
Spark
HIVE on Spark
Spark Streaming -
discretized stream
processing
Spark
Tachyon - in-memory caching DFS
Tuesday, June 18, 13
Shark - HIVE on Spark
Tuesday, June 18, 13
Shark - HIVE on Spark
• 100% HiveQL compatible
Tuesday, June 18, 13
Shark - HIVE on Spark
• 100% HiveQL compatible
• 10-100x faster than HIVE, answers in seconds
Tuesday, June 18, 13
Shark - HIVE on Spark
• 100% HiveQL compatible
• 10-100x faster than HIVE, answers in seconds
• Reuse UDFs, SerDe’s, StorageHandlers
Tuesday, June 18, 13
Shark - HIVE on Spark
• 100% HiveQL compatible
• 10-100x faster than HIVE, answers in seconds
• Reuse UDFs, SerDe’s, StorageHandlers
• Can use DSE / CassandraFS for Metastore
Tuesday, June 18, 13
Shark - HIVE on Spark
• 100% HiveQL compatible
• 10-100x faster than HIVE, answers in seconds
• Reuse UDFs, SerDe’s, StorageHandlers
• Can use DSE / CassandraFS for Metastore
• Easy Scala/Java integration via Spark - easier than
writing UDFs
Tuesday, June 18, 13
Our new analytics architecture
How we integrate Cassandra and Spark/Shark
Tuesday, June 18, 13
From raw events to fast queries
Raw
Events
Raw
Events
Raw
Events
Tuesday, June 18, 13
From raw events to fast queries
Ingestion
C*
event
store
Raw
Events
Raw
Events
Raw
Events
Tuesday, June 18, 13
From raw events to fast queries
Ingestion
C*
event
store
Raw
Events
Raw
Events
Raw
Events
Spark
Spark
Spark
View 1
View 2
View 3
Tuesday, June 18, 13
From raw events to fast queries
Ingestion
C*
event
store
Raw
Events
Raw
Events
Raw
Events
Spark
Spark
Spark
View 1
View 2
View 3
Spark
Predefined
queries
Tuesday, June 18, 13
From raw events to fast queries
Ingestion
C*
event
store
Raw
Events
Raw
Events
Raw
Events
Spark
Spark
Spark
View 1
View 2
View 3
Spark
Shark
Predefined
queries
Ad-hoc
HiveQL
Tuesday, June 18, 13
Our Spark/Shark/Cassandra Stack
Node1
Cassandra
Node2
Cassandra
Node3
Cassandra
Tuesday, June 18, 13
Our Spark/Shark/Cassandra Stack
Node1
Cassandra
InputFormat
SerDe
Node2
Cassandra
InputFormat
SerDe
Node3
Cassandra
InputFormat
SerDe
Tuesday, June 18, 13
Our Spark/Shark/Cassandra Stack
Node1
Cassandra
InputFormat
SerDe
Spark
Worker
Shark
Node2
Cassandra
InputFormat
SerDe
Spark
Worker
Shark
Node3
Cassandra
InputFormat
SerDe
Spark
Worker
Shark
Tuesday, June 18, 13
Our Spark/Shark/Cassandra Stack
Node1
Cassandra
InputFormat
SerDe
Spark
Worker
Shark
Node2
Cassandra
InputFormat
SerDe
Spark
Worker
Shark
Node3
Cassandra
InputFormat
SerDe
Spark
Worker
Shark
Spark Master
Tuesday, June 18, 13
Our Spark/Shark/Cassandra Stack
Node1
Cassandra
InputFormat
SerDe
Spark
Worker
Shark
Node2
Cassandra
InputFormat
SerDe
Spark
Worker
Shark
Node3
Cassandra
InputFormat
SerDe
Spark
Worker
Shark
Spark Master Job Server
Tuesday, June 18, 13
Event Store Cassandra schema
t0 t1 t2 t3 t4
2013-04-05
T00:00Z#id1
{event0:
a0}
{event1:
a1}
{event2:
a2}
{event3:
a3}
{event4:
a4}
Event CF
Tuesday, June 18, 13
Event Store Cassandra schema
t0 t1 t2 t3 t4
2013-04-05
T00:00Z#id1
{event0:
a0}
{event1:
a1}
{event2:
a2}
{event3:
a3}
{event4:
a4}
ipaddr:10.20.30.40:t1 videoId:45678:t1 providerId:500:t0
2013-04-05
T00:00Z#id1
Event CF
EventAttr CF
Tuesday, June 18, 13
Unpacking raw events
t0 t1
2013-04-05
T00:00Z#id1
{video: 10,
type:5}
{video: 11,
type:1}
2013-04-05
T00:00Z#id2
{video: 20,
type:5}
{video: 25,
type:9}
UserID Video Type
id1 10 5
Tuesday, June 18, 13
Unpacking raw events
t0 t1
2013-04-05
T00:00Z#id1
{video: 10,
type:5}
{video: 11,
type:1}
2013-04-05
T00:00Z#id2
{video: 20,
type:5}
{video: 25,
type:9}
UserID Video Type
id1 10 5
id1 11 1
Tuesday, June 18, 13
Unpacking raw events
t0 t1
2013-04-05
T00:00Z#id1
{video: 10,
type:5}
{video: 11,
type:1}
2013-04-05
T00:00Z#id2
{video: 20,
type:5}
{video: 25,
type:9}
UserID Video Type
id1 10 5
id1 11 1
id2 20 5
Tuesday, June 18, 13
Unpacking raw events
t0 t1
2013-04-05
T00:00Z#id1
{video: 10,
type:5}
{video: 11,
type:1}
2013-04-05
T00:00Z#id2
{video: 20,
type:5}
{video: 25,
type:9}
UserID Video Type
id1 10 5
id1 11 1
id2 20 5
id2 25 9
Tuesday, June 18, 13
Tips for InputFormat Development
Tuesday, June 18, 13
Tips for InputFormat Development
• Know which target platforms you are developing for
– Which API to write against? New? Old? Both?
Tuesday, June 18, 13
Tips for InputFormat Development
• Know which target platforms you are developing for
– Which API to write against? New? Old? Both?
• Be prepared to spend time tuning your split computation
– Low latency jobs require fast splits
Tuesday, June 18, 13
Tips for InputFormat Development
• Know which target platforms you are developing for
– Which API to write against? New? Old? Both?
• Be prepared to spend time tuning your split computation
– Low latency jobs require fast splits
• Consider sorting row keys by token for data locality
Tuesday, June 18, 13
Tips for InputFormat Development
• Know which target platforms you are developing for
– Which API to write against? New? Old? Both?
• Be prepared to spend time tuning your split computation
– Low latency jobs require fast splits
• Consider sorting row keys by token for data locality
• Implement predicate pushdown for HIVE SerDe’s
– Use your indexes to reduce size of dataset
Tuesday, June 18, 13
Example: OLAP processing
t0
2013-04
-05T00:
00Z#id1
{video:
10,
type:5}
2013-04
-05T00:
00Z#id2
{video:
20,
type:5}
C* events
Tuesday, June 18, 13
Example: OLAP processing
t0
2013-04
-05T00:
00Z#id1
{video:
10,
type:5}
2013-04
-05T00:
00Z#id2
{video:
20,
type:5}
C* events
OLAP
Aggregates
OLAP
Aggregates
OLAP
Aggregates
Cached Materialized Views
Spark
Spark
Spark
Tuesday, June 18, 13
Example: OLAP processing
t0
2013-04
-05T00:
00Z#id1
{video:
10,
type:5}
2013-04
-05T00:
00Z#id2
{video:
20,
type:5}
C* events
OLAP
Aggregates
OLAP
Aggregates
OLAP
Aggregates
Cached Materialized Views
Spark
Spark
Spark
Union
Tuesday, June 18, 13
Example: OLAP processing
t0
2013-04
-05T00:
00Z#id1
{video:
10,
type:5}
2013-04
-05T00:
00Z#id2
{video:
20,
type:5}
C* events
OLAP
Aggregates
OLAP
Aggregates
OLAP
Aggregates
Cached Materialized Views
Spark
Spark
Spark
Union
Query 1: Plays
by Provider
Tuesday, June 18, 13
Example: OLAP processing
t0
2013-04
-05T00:
00Z#id1
{video:
10,
type:5}
2013-04
-05T00:
00Z#id2
{video:
20,
type:5}
C* events
OLAP
Aggregates
OLAP
Aggregates
OLAP
Aggregates
Cached Materialized Views
Spark
Spark
Spark
Union
Query 1: Plays
by Provider
Query 2: Top
content for
mobile
Tuesday, June 18, 13
Performance numbers
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Tuesday, June 18, 13
Performance numbers
Spark: C* -> OLAP aggregates
cold cache, 1.4 million events
130 seconds
C* -> OLAP aggregates
warmed cache
20-30 seconds
OLAP aggregate query via Spark
(56k records)
60 ms
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Tuesday, June 18, 13
Performance numbers
Spark: C* -> OLAP aggregates
cold cache, 1.4 million events
130 seconds
C* -> OLAP aggregates
warmed cache
20-30 seconds
OLAP aggregate query via Spark
(56k records)
60 ms
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Tuesday, June 18, 13
Performance numbers
Spark: C* -> OLAP aggregates
cold cache, 1.4 million events
130 seconds
C* -> OLAP aggregates
warmed cache
20-30 seconds
OLAP aggregate query via Spark
(56k records)
60 ms
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Tuesday, June 18, 13
OLAP WorkFlow
Aggregation JobSpark
Executors
REST Job Server
Aggregate
Tuesday, June 18, 13
OLAP WorkFlow
Aggregation JobSpark
Executors
Cassandra
REST Job Server
Aggregate
Tuesday, June 18, 13
OLAP WorkFlow
DatasetAggregation JobSpark
Executors
Cassandra
REST Job Server
Aggregate
Tuesday, June 18, 13
OLAP WorkFlow
DatasetAggregation Job Query JobSpark
Executors
Cassandra
REST Job Server
Aggregate Query
Tuesday, June 18, 13
OLAP WorkFlow
DatasetAggregation Job Query JobSpark
Executors
Cassandra
REST Job Server
Aggregate Query
Result
Tuesday, June 18, 13
OLAP WorkFlow
DatasetAggregation Job Query JobSpark
Executors
Cassandra
REST Job Server
Query Job
Aggregate Query
Result
Query
Result
Tuesday, June 18, 13
Fault Tolerance
Tuesday, June 18, 13
Fault Tolerance
• Cached dataset lives in Java Heap only - what if process dies?
Tuesday, June 18, 13
Fault Tolerance
• Cached dataset lives in Java Heap only - what if process dies?
• Spark lineage - automatic recomputation from source, but this is
expensive!
Tuesday, June 18, 13
Fault Tolerance
• Cached dataset lives in Java Heap only - what if process dies?
• Spark lineage - automatic recomputation from source, but this is
expensive!
• Can also replicate cached dataset to survive single node failures
Tuesday, June 18, 13
Fault Tolerance
• Cached dataset lives in Java Heap only - what if process dies?
• Spark lineage - automatic recomputation from source, but this is
expensive!
• Can also replicate cached dataset to survive single node failures
• Persist materialized views back to C*, then load into cache -- now
recovery path is much faster
Tuesday, June 18, 13
Fault Tolerance
• Cached dataset lives in Java Heap only - what if process dies?
• Spark lineage - automatic recomputation from source, but this is
expensive!
• Can also replicate cached dataset to survive single node failures
• Persist materialized views back to C*, then load into cache -- now
recovery path is much faster
• Persistence also enables multiple processes to hold cached dataset
Tuesday, June 18, 13
Demo time
Tuesday, June 18, 13
Shark Demo
• Local shark node, 1 core, MBP
• How to create a table from C* using our inputformat
• Creating a cached Shark table
• Running fast queries
Tuesday, June 18, 13
Creating a Shark Table from InputFormat
Tuesday, June 18, 13
Creating a cached table
Tuesday, June 18, 13
Querying cached table
Tuesday, June 18, 13
THANK YOU
Tuesday, June 18, 13
THANK YOU
• @evanfchan
Tuesday, June 18, 13
THANK YOU
• @evanfchan
• ev@ooyala.com
Tuesday, June 18, 13
THANK YOU
• @evanfchan
• ev@ooyala.com
Tuesday, June 18, 13
THANK YOU
• @evanfchan
• ev@ooyala.com
• WE ARE HIRING!!
Tuesday, June 18, 13
Spark: Under the hood
Map DatasetReduce Map
Driver Map DatasetReduce Map
Map DatasetReduce Map
One executor process per node
Driver
Tuesday, June 18, 13
1 of 128

Recommended

Performance Troubleshooting Using Apache Spark Metrics by
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
3.5K views47 slides
Real-time Analytics with Presto and Apache Pinot by
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotXiang Fu
560 views6 slides
How Apache Drives Music Recommendations At Spotify by
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyJosh Baer
5.8K views37 slides
Integrating Existing C++ Libraries into PySpark with Esther Kundin by
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
4.5K views44 slides
Tag based policies using Apache Atlas and Ranger by
Tag based policies using Apache Atlas and RangerTag based policies using Apache Atlas and Ranger
Tag based policies using Apache Atlas and RangerVimal Sharma
1.7K views17 slides
Hands-on with data visualization in Kibana by
Hands-on with data visualization in KibanaHands-on with data visualization in Kibana
Hands-on with data visualization in KibanaElasticsearch
3.1K views18 slides

More Related Content

What's hot

Introduction to memcached by
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
70.9K views77 slides
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the... by
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...DataWorks Summit/Hadoop Summit
32.8K views36 slides
HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs by
HFile: A Block-Indexed File Format to Store Sorted Key-Value PairsHFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs
HFile: A Block-Indexed File Format to Store Sorted Key-Value PairsSchubert Zhang
8.9K views12 slides
Integration of HIve and HBase by
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
25.2K views39 slides
File Format Benchmarks - Avro, JSON, ORC, & Parquet by
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
101.8K views40 slides
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax by
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. SaxIntroducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. SaxDatabricks
1.7K views33 slides

What's hot(20)

HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs by Schubert Zhang
HFile: A Block-Indexed File Format to Store Sorted Key-Value PairsHFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs
HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs
Schubert Zhang8.9K views
Integration of HIve and HBase by Hortonworks
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
Hortonworks25.2K views
File Format Benchmarks - Avro, JSON, ORC, & Parquet by Owen O'Malley
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley101.8K views
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax by Databricks
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. SaxIntroducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax
Databricks1.7K views
Exactly, ownCloud, Archivematica, Arkivum by Jisc RDM
Exactly, ownCloud, Archivematica, ArkivumExactly, ownCloud, Archivematica, Arkivum
Exactly, ownCloud, Archivematica, Arkivum
Jisc RDM1.5K views
How Adobe Does 2 Million Records Per Second Using Apache Spark! by Databricks
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Databricks614 views
Smart Join Algorithms for Fighting Skew at Scale by Databricks
Smart Join Algorithms for Fighting Skew at ScaleSmart Join Algorithms for Fighting Skew at Scale
Smart Join Algorithms for Fighting Skew at Scale
Databricks1.6K views
HBase and Hadoop at Adobe by Cosmin Lehene
HBase and Hadoop at AdobeHBase and Hadoop at Adobe
HBase and Hadoop at Adobe
Cosmin Lehene6.2K views
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5 by Cloudera, Inc.
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.23.2K views
How we solved Real-time User Segmentation using HBase by DataWorks Summit
How we solved Real-time User Segmentation using HBaseHow we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBase
DataWorks Summit11.4K views
Data Vault Vs Data Lake by Calum Miller
Data Vault Vs Data LakeData Vault Vs Data Lake
Data Vault Vs Data Lake
Calum Miller1.7K views
S3 整合性モデルと Hadoop/Spark の話 by Noritaka Sekiyama
S3 整合性モデルと Hadoop/Spark の話S3 整合性モデルと Hadoop/Spark の話
S3 整合性モデルと Hadoop/Spark の話
Noritaka Sekiyama3.2K views
Automate and Optimize Data Warehouse Migration to Snowflake by Impetus Technologies
Automate and Optimize Data Warehouse Migration to SnowflakeAutomate and Optimize Data Warehouse Migration to Snowflake
Automate and Optimize Data Warehouse Migration to Snowflake
Spark and S3 with Ryan Blue by Databricks
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
Databricks3.9K views
Data Modeling & Metadata for Graph Databases by DATAVERSITY
Data Modeling & Metadata for Graph DatabasesData Modeling & Metadata for Graph Databases
Data Modeling & Metadata for Graph Databases
DATAVERSITY3.6K views

Viewers also liked

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar) by
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
83.3K views77 slides
Distributed Processing of Stream Text Mining by
Distributed Processing of Stream Text MiningDistributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningLi Miao
1K views41 slides
Science in text mining by
Science in text miningScience in text mining
Science in text miningTanay Chowdhury
1.5K views19 slides
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305 by
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305mjfrankli
2.7K views38 slides
Advanced Analytics in Hadoop by
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in HadoopAnalyticsWeek
3.3K views44 slides
Talk About Apache Cassandra by
Talk About Apache CassandraTalk About Apache Cassandra
Talk About Apache CassandraJacky Chu
3.3K views31 slides

Viewers also liked(20)

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar) by Helena Edelson
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson83.3K views
Distributed Processing of Stream Text Mining by Li Miao
Distributed Processing of Stream Text MiningDistributed Processing of Stream Text Mining
Distributed Processing of Stream Text Mining
Li Miao1K views
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305 by mjfrankli
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
mjfrankli2.7K views
Advanced Analytics in Hadoop by AnalyticsWeek
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
AnalyticsWeek3.3K views
Talk About Apache Cassandra by Jacky Chu
Talk About Apache CassandraTalk About Apache Cassandra
Talk About Apache Cassandra
Jacky Chu3.3K views
Apache Cassandra training. Overview and Basics by Oleg Magazov
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and Basics
Oleg Magazov2K views
Apache Cassandra by Sperasoft
Apache CassandraApache Cassandra
Apache Cassandra
Sperasoft1.4K views
Apache Cassandra overview by ElifTech
Apache Cassandra overviewApache Cassandra overview
Apache Cassandra overview
ElifTech932 views
Big data analytics with Spark & Cassandra by Matthias Niehoff
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
Matthias Niehoff2.7K views
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict... by Databricks
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Databricks6.4K views
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016 by DataStax
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
DataStax10.7K views
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa... by Spark Summit
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Spark Summit6.7K views
Cassandra at NoSql Matters 2012 by jbellis
Cassandra at NoSql Matters 2012Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012
jbellis5.2M views
Introduction to Cassandra Basics by nickmbailey
Introduction to Cassandra BasicsIntroduction to Cassandra Basics
Introduction to Cassandra Basics
nickmbailey14.7K views
Indexing in Cassandra by Ed Anuff
Indexing in CassandraIndexing in Cassandra
Indexing in Cassandra
Ed Anuff26.6K views
How Do I Cassandra? by Rick Branson
How Do I Cassandra?How Do I Cassandra?
How Do I Cassandra?
Rick Branson37.7K views
Understanding Data Partitioning and Replication in Apache Cassandra by DataStax
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache Cassandra
DataStax20.1K views
Interactive workflow management using Azkaban by datamantra
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
datamantra3.2K views
Introduction to Apache Cassandra by Robert Stupp
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
Robert Stupp13.8K views

Similar to Real-time Analytics with Cassandra, Spark, and Shark

Scaling postgres by
Scaling postgresScaling postgres
Scaling postgresDenish Patel
21.5K views289 slides
Building Antifragile Applications with Apache Cassandra by
Building Antifragile Applications with Apache CassandraBuilding Antifragile Applications with Apache Cassandra
Building Antifragile Applications with Apache CassandraPatrick McFadin
4K views37 slides
Finding sensitive information in text data by
Finding sensitive information in text dataFinding sensitive information in text data
Finding sensitive information in text dataInfinIT - Innovationsnetværket for it
583 views24 slides
AppEngine Performance Tuning by
AppEngine Performance TuningAppEngine Performance Tuning
AppEngine Performance TuningDavid Chen
6.6K views62 slides
Cloud east shutl_talk by
Cloud east shutl_talkCloud east shutl_talk
Cloud east shutl_talkVolker Pacher
461 views74 slides
Cassandra at scale by
Cassandra at scaleCassandra at scale
Cassandra at scalePatrick McFadin
3K views17 slides

Similar to Real-time Analytics with Cassandra, Spark, and Shark(20)

Scaling postgres by Denish Patel
Scaling postgresScaling postgres
Scaling postgres
Denish Patel21.5K views
Building Antifragile Applications with Apache Cassandra by Patrick McFadin
Building Antifragile Applications with Apache CassandraBuilding Antifragile Applications with Apache Cassandra
Building Antifragile Applications with Apache Cassandra
Patrick McFadin4K views
AppEngine Performance Tuning by David Chen
AppEngine Performance TuningAppEngine Performance Tuning
AppEngine Performance Tuning
David Chen6.6K views
What do you mean, backwards compatibility? by Trisha Gee
What do you mean, backwards compatibility?What do you mean, backwards compatibility?
What do you mean, backwards compatibility?
Trisha Gee1.7K views
Elastic at Procter & Gamble: A Network Story by Elasticsearch
Elastic at Procter & Gamble: A Network StoryElastic at Procter & Gamble: A Network Story
Elastic at Procter & Gamble: A Network Story
Elasticsearch3.8K views
Painful Success - Lessons Learned while Scaling Up by C4Media
Painful Success - Lessons Learned while Scaling UpPainful Success - Lessons Learned while Scaling Up
Painful Success - Lessons Learned while Scaling Up
C4Media703 views
dataviz on d3.js + elasticsearch by Mathieu Elie
dataviz on d3.js + elasticsearchdataviz on d3.js + elasticsearch
dataviz on d3.js + elasticsearch
Mathieu Elie5.9K views
Hotspot Garbage Collection - Tuning Guide by jClarity
Hotspot Garbage Collection - Tuning GuideHotspot Garbage Collection - Tuning Guide
Hotspot Garbage Collection - Tuning Guide
jClarity9.1K views
Ab(Using) the MetaCPAN API for Fun and Profit v2013 by Olaf Alders
Ab(Using) the MetaCPAN API for Fun and Profit v2013Ab(Using) the MetaCPAN API for Fun and Profit v2013
Ab(Using) the MetaCPAN API for Fun and Profit v2013
Olaf Alders5.6K views
Structure Data 2014: INVERTING 80/20: BEYOND BESPOKE BIG DATA, Ari Gesher by Gigaom
Structure Data 2014: INVERTING 80/20: BEYOND BESPOKE BIG DATA, Ari GesherStructure Data 2014: INVERTING 80/20: BEYOND BESPOKE BIG DATA, Ari Gesher
Structure Data 2014: INVERTING 80/20: BEYOND BESPOKE BIG DATA, Ari Gesher
Gigaom964 views
Proud to be polyglot! by NLJUG
Proud to be polyglot!Proud to be polyglot!
Proud to be polyglot!
NLJUG636 views
Intridea ajn-rttos OA NYC Summit by Open Analytics
Intridea ajn-rttos OA NYC SummitIntridea ajn-rttos OA NYC Summit
Intridea ajn-rttos OA NYC Summit
Open Analytics4.4K views
Big Data — Your new best friend by Reuven Lerner
Big Data — Your new best friendBig Data — Your new best friend
Big Data — Your new best friend
Reuven Lerner615 views
Scaling PHP to 40 Million Uniques by Jonathan Klein
Scaling PHP to 40 Million UniquesScaling PHP to 40 Million Uniques
Scaling PHP to 40 Million Uniques
Jonathan Klein7.3K views
RIPE NCC Services & Activities - NaMeX Regional Meeting by RIPE NCC
RIPE NCC Services & Activities - NaMeX Regional MeetingRIPE NCC Services & Activities - NaMeX Regional Meeting
RIPE NCC Services & Activities - NaMeX Regional Meeting
RIPE NCC604 views
Wsrest 2013 by Caelum
Wsrest 2013Wsrest 2013
Wsrest 2013
Caelum2.1K views

More from Evan Chan

Porting a Streaming Pipeline from Scala to Rust by
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustEvan Chan
7 views38 slides
Designing Stateful Apps for Cloud and Kubernetes by
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesEvan Chan
222 views33 slides
Histograms at scale - Monitorama 2019 by
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Evan Chan
1.5K views36 slides
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale by
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
1.1K views52 slides
Building a High-Performance Database with Scala, Akka, and Spark by
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkEvan Chan
2.8K views29 slides
700 Updatable Queries Per Second: Spark as a Real-Time Web Service by
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
993 views41 slides

More from Evan Chan(16)

Porting a Streaming Pipeline from Scala to Rust by Evan Chan
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
Evan Chan7 views
Designing Stateful Apps for Cloud and Kubernetes by Evan Chan
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes
Evan Chan222 views
Histograms at scale - Monitorama 2019 by Evan Chan
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
Evan Chan1.5K views
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale by Evan Chan
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
Evan Chan1.1K views
Building a High-Performance Database with Scala, Akka, and Spark by Evan Chan
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan2.8K views
700 Updatable Queries Per Second: Spark as a Real-Time Web Service by Evan Chan
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan993 views
Building Scalable Data Pipelines - 2016 DataPalooza Seattle by Evan Chan
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan5.7K views
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark by Evan Chan
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan5.8K views
Breakthrough OLAP performance with Cassandra and Spark by Evan Chan
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan8.8K views
Productionizing Spark and the Spark Job Server by Evan Chan
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan16.8K views
Akka in Production - ScalaDays 2015 by Evan Chan
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
Evan Chan53.3K views
MIT lecture - Socrata Open Data Architecture by Evan Chan
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
Evan Chan3K views
OLAP with Cassandra and Spark by Evan Chan
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
Evan Chan20K views
Spark Summit 2014: Spark Job Server Talk by Evan Chan
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
Evan Chan4.4K views
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14) by Evan Chan
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Evan Chan11.8K views
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark by Evan Chan
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Evan Chan11.1K views

Recently uploaded

TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensorssugiuralab
19 views15 slides
PRODUCT PRESENTATION.pptx by
PRODUCT PRESENTATION.pptxPRODUCT PRESENTATION.pptx
PRODUCT PRESENTATION.pptxangelicacueva6
14 views1 slide
Scaling Knowledge Graph Architectures with AI by
Scaling Knowledge Graph Architectures with AIScaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AIEnterprise Knowledge
30 views15 slides
Design Driven Network Assurance by
Design Driven Network AssuranceDesign Driven Network Assurance
Design Driven Network AssuranceNetwork Automation Forum
15 views42 slides
Case Study Copenhagen Energy and Business Central.pdf by
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdfAitana
16 views3 slides
Kyo - Functional Scala 2023.pdf by
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfFlavio W. Brasil
368 views92 slides

Recently uploaded(20)

TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab19 views
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana16 views
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma39 views
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman33 views

Real-time Analytics with Cassandra, Spark, and Shark

  • 1. Real-time Analytics with Cassandra, Spark and Shark Tuesday, June 18, 13
  • 2. Who is this guy • Staff Engineer, Compute and Data Services, Ooyala • Building multiple web-scale real-time systems on top of C*, Kafka, Storm, etc. • Scala/Akka guy • Very excited by open source, big data projects - share some today • @evanfchan Tuesday, June 18, 13
  • 4. Agenda • Ooyala and Cassandra Tuesday, June 18, 13
  • 5. Agenda • Ooyala and Cassandra • What problem are we trying to solve? Tuesday, June 18, 13
  • 6. Agenda • Ooyala and Cassandra • What problem are we trying to solve? • Spark and Shark Tuesday, June 18, 13
  • 7. Agenda • Ooyala and Cassandra • What problem are we trying to solve? • Spark and Shark • Our Spark/Cassandra Architecture Tuesday, June 18, 13
  • 8. Agenda • Ooyala and Cassandra • What problem are we trying to solve? • Spark and Shark • Our Spark/Cassandra Architecture • Demo Tuesday, June 18, 13
  • 9. Cassandra at Ooyala Who is Ooyala, and how we use Cassandra Tuesday, June 18, 13
  • 10. CONFIDENTIAL—DO NOT DISTRIBUTE OOYALA Powering personalized video experiences across all screens. 5 Tuesday, June 18, 13
  • 11. CONFIDENTIAL—DO NOT DISTRIBUTE 6CONFIDENTIAL—DO NOT DISTRIBUTE Founded in 2007 Commercially launch in 2009 230+ employees in Silicon Valley, LA, NYC, London, Paris, Tokyo, Sydney & Guadalajara Global footprint, 200M unique users, 110+ countries, and more than 6,000 websites Over 1 billion videos played per month and 2 billion analytic events per day 25% of U.S. online viewers watch video powered by Ooyala COMPANY OVERVIEW Tuesday, June 18, 13
  • 12. CONFIDENTIAL—DO NOT DISTRIBUTE 7 TRUSTED VIDEO PARTNER STRATEGIC PARTNERS CUSTOMERS CONFIDENTIAL—DO NOT DISTRIBUTE Tuesday, June 18, 13
  • 13. We are a large Cassandra user Tuesday, June 18, 13
  • 14. We are a large Cassandra user • 11 clusters ranging in size from 3 to 36 nodes Tuesday, June 18, 13
  • 15. We are a large Cassandra user • 11 clusters ranging in size from 3 to 36 nodes • Total of 28TB of data managed over ~85 nodes Tuesday, June 18, 13
  • 16. We are a large Cassandra user • 11 clusters ranging in size from 3 to 36 nodes • Total of 28TB of data managed over ~85 nodes • Over 2 billion C* column writes per day Tuesday, June 18, 13
  • 17. We are a large Cassandra user • 11 clusters ranging in size from 3 to 36 nodes • Total of 28TB of data managed over ~85 nodes • Over 2 billion C* column writes per day • Powers all of our analytics infrastructure Tuesday, June 18, 13
  • 18. We are a large Cassandra user • 11 clusters ranging in size from 3 to 36 nodes • Total of 28TB of data managed over ~85 nodes • Over 2 billion C* column writes per day • Powers all of our analytics infrastructure • Much much bigger cluster coming soon Tuesday, June 18, 13
  • 19. What problem are we trying to solve? Lots of data, complex queries, answered really quickly... but how?? Tuesday, June 18, 13
  • 20. From mountains of useless data... Tuesday, June 18, 13
  • 21. To nuggets of truth... Tuesday, June 18, 13
  • 22. To nuggets of truth... • Quickly Tuesday, June 18, 13
  • 23. To nuggets of truth... • Quickly • Painlessly Tuesday, June 18, 13
  • 24. To nuggets of truth... • Quickly • Painlessly • At scale? Tuesday, June 18, 13
  • 26. Today: Precomputed aggregates • Video metrics computed along several high cardinality dimensions Tuesday, June 18, 13
  • 27. Today: Precomputed aggregates • Video metrics computed along several high cardinality dimensions • Very fast lookups, but inflexible, and hard to change Tuesday, June 18, 13
  • 28. Today: Precomputed aggregates • Video metrics computed along several high cardinality dimensions • Very fast lookups, but inflexible, and hard to change • Most computed aggregates are never read Tuesday, June 18, 13
  • 29. Today: Precomputed aggregates • Video metrics computed along several high cardinality dimensions • Very fast lookups, but inflexible, and hard to change • Most computed aggregates are never read • What if we need more dynamic queries? – Top content for mobile users in France – Engagement curves for users who watched recommendations – Data mining, trends, machine learning Tuesday, June 18, 13
  • 30. The static - dynamic continuum • Super fast lookups • Inflexible, wasteful • Best for 80% most common queries • Always compute results from raw data • Flexible but slow 100% Precomputation 100% Dynamic Tuesday, June 18, 13
  • 31. The static - dynamic continuum • Super fast lookups • Inflexible, wasteful • Best for 80% most common queries • Always compute results from raw data • Flexible but slow 100% Precomputation100% Dynamic Tuesday, June 18, 13
  • 32. Where we want to be Partly dynamic • Pre-aggregate most common queries • Flexible, fast dynamic queries • Easily generate many materialized views Tuesday, June 18, 13
  • 34. Industry Trends • Fast execution frameworks – Impala Tuesday, June 18, 13
  • 35. Industry Trends • Fast execution frameworks – Impala • In-memory databases – VoltDB, Druid Tuesday, June 18, 13
  • 36. Industry Trends • Fast execution frameworks – Impala • In-memory databases – VoltDB, Druid • Streaming and real-time Tuesday, June 18, 13
  • 37. Industry Trends • Fast execution frameworks – Impala • In-memory databases – VoltDB, Druid • Streaming and real-time • Higher-level, productive data frameworks – Cascading, Hive, Pig Tuesday, June 18, 13
  • 38. Why Spark and Shark? “Lightning-fast in-memory cluster computing” Tuesday, June 18, 13
  • 40. Introduction to Spark • In-memory distributed computing framework Tuesday, June 18, 13
  • 41. Introduction to Spark • In-memory distributed computing framework • Created by UC Berkeley AMP Lab in 2010 Tuesday, June 18, 13
  • 42. Introduction to Spark • In-memory distributed computing framework • Created by UC Berkeley AMP Lab in 2010 • Targeted problems that MR is bad at: – Iterative algorithms (machine learning) – Interactive data mining Tuesday, June 18, 13
  • 43. Introduction to Spark • In-memory distributed computing framework • Created by UC Berkeley AMP Lab in 2010 • Targeted problems that MR is bad at: – Iterative algorithms (machine learning) – Interactive data mining • More general purpose than Hadoop MR Tuesday, June 18, 13
  • 44. Introduction to Spark • In-memory distributed computing framework • Created by UC Berkeley AMP Lab in 2010 • Targeted problems that MR is bad at: – Iterative algorithms (machine learning) – Interactive data mining • More general purpose than Hadoop MR • Active contributions from ~ 15 companies Tuesday, June 18, 13
  • 50. Throughput: Memory is king 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Tuesday, June 18, 13
  • 51. Throughput: Memory is king 0 37500 75000 112500 150000 C*, cold cache C*, warm cache Spark RDD 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Tuesday, June 18, 13
  • 52. Throughput: Memory is king 0 37500 75000 112500 150000 C*, cold cache C*, warm cache Spark RDD 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Tuesday, June 18, 13
  • 53. Throughput: Memory is king 0 37500 75000 112500 150000 C*, cold cache C*, warm cache Spark RDD 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Tuesday, June 18, 13
  • 54. Throughput: Memory is king 0 37500 75000 112500 150000 C*, cold cache C*, warm cache Spark RDD 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Tuesday, June 18, 13
  • 56. Developers love it • “I wrote my first aggregation job in 30 minutes” Tuesday, June 18, 13
  • 57. Developers love it • “I wrote my first aggregation job in 30 minutes” • High level “distributed collections” API Tuesday, June 18, 13
  • 58. Developers love it • “I wrote my first aggregation job in 30 minutes” • High level “distributed collections” API • No Hadoop cruft Tuesday, June 18, 13
  • 59. Developers love it • “I wrote my first aggregation job in 30 minutes” • High level “distributed collections” API • No Hadoop cruft • Full power of Scala, Java, Python Tuesday, June 18, 13
  • 60. Developers love it • “I wrote my first aggregation job in 30 minutes” • High level “distributed collections” API • No Hadoop cruft • Full power of Scala, Java, Python • Interactive REPL shell Tuesday, June 18, 13
  • 61. Developers love it • “I wrote my first aggregation job in 30 minutes” • High level “distributed collections” API • No Hadoop cruft • Full power of Scala, Java, Python • Interactive REPL shell • EASY testing!! Tuesday, June 18, 13
  • 62. Developers love it • “I wrote my first aggregation job in 30 minutes” • High level “distributed collections” API • No Hadoop cruft • Full power of Scala, Java, Python • Interactive REPL shell • EASY testing!! • Low latency - quick development cycles Tuesday, June 18, 13
  • 63. Spark word count example 1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 } Tuesday, June 18, 13
  • 64. Spark word count example file = spark.textFile("hdfs://...")   file.flatMap(line => line.split(" "))     .map(word => (word, 1))     .reduceByKey(_ + _) 1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 } Tuesday, June 18, 13
  • 65. The Spark Ecosystem Spark Tachyon - in-memory caching DFS Tuesday, June 18, 13
  • 66. The Spark Ecosystem Bagel - Pregel on Spark Spark Tachyon - in-memory caching DFS Tuesday, June 18, 13
  • 67. The Spark Ecosystem Bagel - Pregel on Spark HIVE on Spark Spark Tachyon - in-memory caching DFS Tuesday, June 18, 13
  • 68. The Spark Ecosystem Bagel - Pregel on Spark HIVE on Spark Spark Streaming - discretized stream processing Spark Tachyon - in-memory caching DFS Tuesday, June 18, 13
  • 69. Shark - HIVE on Spark Tuesday, June 18, 13
  • 70. Shark - HIVE on Spark • 100% HiveQL compatible Tuesday, June 18, 13
  • 71. Shark - HIVE on Spark • 100% HiveQL compatible • 10-100x faster than HIVE, answers in seconds Tuesday, June 18, 13
  • 72. Shark - HIVE on Spark • 100% HiveQL compatible • 10-100x faster than HIVE, answers in seconds • Reuse UDFs, SerDe’s, StorageHandlers Tuesday, June 18, 13
  • 73. Shark - HIVE on Spark • 100% HiveQL compatible • 10-100x faster than HIVE, answers in seconds • Reuse UDFs, SerDe’s, StorageHandlers • Can use DSE / CassandraFS for Metastore Tuesday, June 18, 13
  • 74. Shark - HIVE on Spark • 100% HiveQL compatible • 10-100x faster than HIVE, answers in seconds • Reuse UDFs, SerDe’s, StorageHandlers • Can use DSE / CassandraFS for Metastore • Easy Scala/Java integration via Spark - easier than writing UDFs Tuesday, June 18, 13
  • 75. Our new analytics architecture How we integrate Cassandra and Spark/Shark Tuesday, June 18, 13
  • 76. From raw events to fast queries Raw Events Raw Events Raw Events Tuesday, June 18, 13
  • 77. From raw events to fast queries Ingestion C* event store Raw Events Raw Events Raw Events Tuesday, June 18, 13
  • 78. From raw events to fast queries Ingestion C* event store Raw Events Raw Events Raw Events Spark Spark Spark View 1 View 2 View 3 Tuesday, June 18, 13
  • 79. From raw events to fast queries Ingestion C* event store Raw Events Raw Events Raw Events Spark Spark Spark View 1 View 2 View 3 Spark Predefined queries Tuesday, June 18, 13
  • 80. From raw events to fast queries Ingestion C* event store Raw Events Raw Events Raw Events Spark Spark Spark View 1 View 2 View 3 Spark Shark Predefined queries Ad-hoc HiveQL Tuesday, June 18, 13
  • 86. Event Store Cassandra schema t0 t1 t2 t3 t4 2013-04-05 T00:00Z#id1 {event0: a0} {event1: a1} {event2: a2} {event3: a3} {event4: a4} Event CF Tuesday, June 18, 13
  • 87. Event Store Cassandra schema t0 t1 t2 t3 t4 2013-04-05 T00:00Z#id1 {event0: a0} {event1: a1} {event2: a2} {event3: a3} {event4: a4} ipaddr:10.20.30.40:t1 videoId:45678:t1 providerId:500:t0 2013-04-05 T00:00Z#id1 Event CF EventAttr CF Tuesday, June 18, 13
  • 88. Unpacking raw events t0 t1 2013-04-05 T00:00Z#id1 {video: 10, type:5} {video: 11, type:1} 2013-04-05 T00:00Z#id2 {video: 20, type:5} {video: 25, type:9} UserID Video Type id1 10 5 Tuesday, June 18, 13
  • 89. Unpacking raw events t0 t1 2013-04-05 T00:00Z#id1 {video: 10, type:5} {video: 11, type:1} 2013-04-05 T00:00Z#id2 {video: 20, type:5} {video: 25, type:9} UserID Video Type id1 10 5 id1 11 1 Tuesday, June 18, 13
  • 90. Unpacking raw events t0 t1 2013-04-05 T00:00Z#id1 {video: 10, type:5} {video: 11, type:1} 2013-04-05 T00:00Z#id2 {video: 20, type:5} {video: 25, type:9} UserID Video Type id1 10 5 id1 11 1 id2 20 5 Tuesday, June 18, 13
  • 91. Unpacking raw events t0 t1 2013-04-05 T00:00Z#id1 {video: 10, type:5} {video: 11, type:1} 2013-04-05 T00:00Z#id2 {video: 20, type:5} {video: 25, type:9} UserID Video Type id1 10 5 id1 11 1 id2 20 5 id2 25 9 Tuesday, June 18, 13
  • 92. Tips for InputFormat Development Tuesday, June 18, 13
  • 93. Tips for InputFormat Development • Know which target platforms you are developing for – Which API to write against? New? Old? Both? Tuesday, June 18, 13
  • 94. Tips for InputFormat Development • Know which target platforms you are developing for – Which API to write against? New? Old? Both? • Be prepared to spend time tuning your split computation – Low latency jobs require fast splits Tuesday, June 18, 13
  • 95. Tips for InputFormat Development • Know which target platforms you are developing for – Which API to write against? New? Old? Both? • Be prepared to spend time tuning your split computation – Low latency jobs require fast splits • Consider sorting row keys by token for data locality Tuesday, June 18, 13
  • 96. Tips for InputFormat Development • Know which target platforms you are developing for – Which API to write against? New? Old? Both? • Be prepared to spend time tuning your split computation – Low latency jobs require fast splits • Consider sorting row keys by token for data locality • Implement predicate pushdown for HIVE SerDe’s – Use your indexes to reduce size of dataset Tuesday, June 18, 13
  • 98. Example: OLAP processing t0 2013-04 -05T00: 00Z#id1 {video: 10, type:5} 2013-04 -05T00: 00Z#id2 {video: 20, type:5} C* events OLAP Aggregates OLAP Aggregates OLAP Aggregates Cached Materialized Views Spark Spark Spark Tuesday, June 18, 13
  • 99. Example: OLAP processing t0 2013-04 -05T00: 00Z#id1 {video: 10, type:5} 2013-04 -05T00: 00Z#id2 {video: 20, type:5} C* events OLAP Aggregates OLAP Aggregates OLAP Aggregates Cached Materialized Views Spark Spark Spark Union Tuesday, June 18, 13
  • 100. Example: OLAP processing t0 2013-04 -05T00: 00Z#id1 {video: 10, type:5} 2013-04 -05T00: 00Z#id2 {video: 20, type:5} C* events OLAP Aggregates OLAP Aggregates OLAP Aggregates Cached Materialized Views Spark Spark Spark Union Query 1: Plays by Provider Tuesday, June 18, 13
  • 101. Example: OLAP processing t0 2013-04 -05T00: 00Z#id1 {video: 10, type:5} 2013-04 -05T00: 00Z#id2 {video: 20, type:5} C* events OLAP Aggregates OLAP Aggregates OLAP Aggregates Cached Materialized Views Spark Spark Spark Union Query 1: Plays by Provider Query 2: Top content for mobile Tuesday, June 18, 13
  • 102. Performance numbers 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Tuesday, June 18, 13
  • 103. Performance numbers Spark: C* -> OLAP aggregates cold cache, 1.4 million events 130 seconds C* -> OLAP aggregates warmed cache 20-30 seconds OLAP aggregate query via Spark (56k records) 60 ms 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Tuesday, June 18, 13
  • 104. Performance numbers Spark: C* -> OLAP aggregates cold cache, 1.4 million events 130 seconds C* -> OLAP aggregates warmed cache 20-30 seconds OLAP aggregate query via Spark (56k records) 60 ms 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Tuesday, June 18, 13
  • 105. Performance numbers Spark: C* -> OLAP aggregates cold cache, 1.4 million events 130 seconds C* -> OLAP aggregates warmed cache 20-30 seconds OLAP aggregate query via Spark (56k records) 60 ms 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Tuesday, June 18, 13
  • 106. OLAP WorkFlow Aggregation JobSpark Executors REST Job Server Aggregate Tuesday, June 18, 13
  • 107. OLAP WorkFlow Aggregation JobSpark Executors Cassandra REST Job Server Aggregate Tuesday, June 18, 13
  • 108. OLAP WorkFlow DatasetAggregation JobSpark Executors Cassandra REST Job Server Aggregate Tuesday, June 18, 13
  • 109. OLAP WorkFlow DatasetAggregation Job Query JobSpark Executors Cassandra REST Job Server Aggregate Query Tuesday, June 18, 13
  • 110. OLAP WorkFlow DatasetAggregation Job Query JobSpark Executors Cassandra REST Job Server Aggregate Query Result Tuesday, June 18, 13
  • 111. OLAP WorkFlow DatasetAggregation Job Query JobSpark Executors Cassandra REST Job Server Query Job Aggregate Query Result Query Result Tuesday, June 18, 13
  • 113. Fault Tolerance • Cached dataset lives in Java Heap only - what if process dies? Tuesday, June 18, 13
  • 114. Fault Tolerance • Cached dataset lives in Java Heap only - what if process dies? • Spark lineage - automatic recomputation from source, but this is expensive! Tuesday, June 18, 13
  • 115. Fault Tolerance • Cached dataset lives in Java Heap only - what if process dies? • Spark lineage - automatic recomputation from source, but this is expensive! • Can also replicate cached dataset to survive single node failures Tuesday, June 18, 13
  • 116. Fault Tolerance • Cached dataset lives in Java Heap only - what if process dies? • Spark lineage - automatic recomputation from source, but this is expensive! • Can also replicate cached dataset to survive single node failures • Persist materialized views back to C*, then load into cache -- now recovery path is much faster Tuesday, June 18, 13
  • 117. Fault Tolerance • Cached dataset lives in Java Heap only - what if process dies? • Spark lineage - automatic recomputation from source, but this is expensive! • Can also replicate cached dataset to survive single node failures • Persist materialized views back to C*, then load into cache -- now recovery path is much faster • Persistence also enables multiple processes to hold cached dataset Tuesday, June 18, 13
  • 119. Shark Demo • Local shark node, 1 core, MBP • How to create a table from C* using our inputformat • Creating a cached Shark table • Running fast queries Tuesday, June 18, 13
  • 120. Creating a Shark Table from InputFormat Tuesday, June 18, 13
  • 121. Creating a cached table Tuesday, June 18, 13
  • 125. THANK YOU • @evanfchan • ev@ooyala.com Tuesday, June 18, 13
  • 126. THANK YOU • @evanfchan • ev@ooyala.com Tuesday, June 18, 13
  • 127. THANK YOU • @evanfchan • ev@ooyala.com • WE ARE HIRING!! Tuesday, June 18, 13
  • 128. Spark: Under the hood Map DatasetReduce Map Driver Map DatasetReduce Map Map DatasetReduce Map One executor process per node Driver Tuesday, June 18, 13