Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Real-time Analytics with
Cassandra, Spark and Shark

Saturday, December 7, 13

Who is this guy
•

Staff Engineer, Compute and Data Services, Ooyala

•

Building multiple web-scale real-time systems on top of C*, Kafka,
Storm, etc.

•

Scala/Akka guy

•

github.com/velvia

•

@evanfchan


Agenda
• Ooyala and Cassandra
• What problem are we trying to solve?
• Spark and Shark
• Our Spark/Cassandra Architecture
• Demo


Cassandra at Ooyala
Who is Ooyala, and how we use Cassandra


OOYALA
Powering personalized video
experiences across all screens.

CONFIDENTIAL—DO NOT DISTRIBUTE


5

COMPANY OVERVIEW
Founded in 2007
Commercially launch in 2009
300 employees in Silicon Valley, LA, NYC,
London, Paris, Tokyo, Sydney & Guadalajara
Global footprint, 200M unique users,
110+ countries, and more than 6,000 websites
Over 1 billion videos played per month
and 2 billion analytic events per day
25% of U.S. online viewers watch video
powered by Ooyala



6

TRUSTED VIDEO PARTNER
CUSTOMERS

STRATEGIC PARTNERS



7

We are a large Cassandra user
•

12 clusters ranging in size from 3 to 107 nodes

•

Total of 28TB of data managed over ~220 nodes

•

Over 2 billion C* column writes per day

•

Powers all of our analytics infrastructure

•

DSE/C* 1.0.x, 1.1.x, 1.2.6

•

Large prod cluster is one of the biggest Cassandra
installations


What problem are we trying to
solve?
Lots of data, complex queries, answered really quickly... but how??


From mountains of raw data...


To nuggets of truth...
• Quickly
• Painlessly
• At


scale?

Today: Precomputed aggregates
•

Video metrics computed along several high cardinality dimensions

•

Very fast lookups, but inﬂexible, and hard to change

•

Most computed aggregates are never read

•

What if we need more dynamic queries?
–

Top content for mobile users in France

–

Engagement curves for users who watched recommendations

–

Data mining, trends, machine learning


The static - dynamic continuum
100% Precomputation
•

Super fast lookups

•

Inﬂexible, wasteful

•

Best for 80% most
common queries

100% Dynamic


•

Always compute results
from raw data

•

Flexible but slow

Where we want to be
Partly dynamic
•
•

Flexible, fast dynamic
queries

•


Pre-aggregate most
common queries

Easily generate many
materialized views

Industry Trends
•

Fast execution frameworks
–

•

Impala, Drill, Presto

In-memory databases
–

VoltDB, Druid

•

Streaming and real-time

•

Higher-level, productive data frameworks
–

Cascading, Hive, Pig


Why Spark and Shark?
“Lightning-fast in-memory cluster computing”


Introduction to Spark
• In-memory distributed computing framework
• Created by UC Berkeley AMP Lab in 2010
• Targeted problems that MR is bad at:
– Iterative algorithms (machine learning)
– Interactive data mining
• More general purpose than Hadoop MR
• Active contributions from ~ 15 companies


Data Source
Source 2

Map
map()
Reduce
join()

HDFS
Map
Reduce


cache()
transform

Throughput: Memory is king
0
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0


37500

75000

112500

150000

Developers love it
• “I wrote my ﬁrst aggregation job in 30 minutes”
• High level “distributed collections” API
• No Hadoop cruft
• Full power of Scala, Java, Python
• Interactive REPL shell
• EASY testing!!
• Low latency - quick development cycles


Spark word count example

file = spark.textFile("hdfs://...")

file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)


1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
18
private final static IntWritable one = new IntWritable(1);
19
private Text word = new Text();
20
21
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
22
String line = value.toString();
23
StringTokenizer tokenizer = new StringTokenizer(line);
24
while (tokenizer.hasMoreTokens()) {
25
word.set(tokenizer.nextToken());
26
context.write(word, one);
27
}
28
}
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
32
33
public void reduce(Text key, Iterable<IntWritable> values, Context context)
34
throws IOException, InterruptedException {
35
int sum = 0;
36
for (IntWritable val : values) {
37
sum += val.get();
38
}
39
context.write(key, new IntWritable(sum));
40
}
41 }
42
43 public static void main(String[] args) throws Exception {
44
Configuration conf = new Configuration();
45
46
Job job = new Job(conf, "wordcount");
47
48
job.setOutputKeyClass(Text.class);
49
job.setOutputValueClass(IntWritable.class);
50
51
job.setMapperClass(Map.class);
52
job.setReducerClass(Reduce.class);
53
54
job.setInputFormatClass(TextInputFormat.class);
55
job.setOutputFormatClass(TextOutputFormat.class);
56
57
FileInputFormat.addInputPath(job, new Path(args[0]));
58
FileOutputFormat.setOutputPath(job, new Path(args[1]));
59
60
job.waitForCompletion(true);
61 }
62
63 }

The Spark Ecosystem
Bagel Pregel on
Spark

HIVE on Spark

Spark Streaming discretized stream
processing

Spark
Tachyon - in-memory caching DFS


Shark - HIVE on Spark
• 100% HiveQL compatible
• 10-100x faster than HIVE, answers in seconds
• Reuse UDFs, SerDe’s, StorageHandlers
• Can use DSE / CassandraFS for Metastore
• Easy Scala/Java integration via Spark - easier than
writing UDFs


Our new analytics architecture
How we integrate Cassandra and Spark/Shark


From raw events to fast queries
Raw
Events
Raw
Events

Spark

Ingestion

Raw
Events


C*
event
store

View 1

Spark

Predeﬁned
queries

Spark

View 2

Shark

Ad-hoc
HiveQL

Spark

View 3

Our Spark/Shark/Cassandra Stack
Spark Master

Shark

Spark
Worker

Shark

Spark
Worker

SerDe

Job Server

Shark

Spark
Worker

SerDe

SerDe

InputFormat

InputFormat

InputFormat

Cassandra

Cassandra

Cassandra

Node1

Node2

Node3


Event Store Cassandra schema
Event CF
t0
2013-04-05
T00:00Z#id1

t1

t2

t3

t4

{event0:
a0}

{event1:
a1}

{event2:
a2}

{event3:
a3}

{event4:
a4}

EventAttr CF
ipaddr:10.20.30.40:t1
2013-04-05
T00:00Z#id1


videoId:45678:t1

providerId:500:t0

Unpacking raw events
t0

t1

2013-04-05
T00:00Z#id1

{video: 10, {video: 11,
type:5}
type:1}

2013-04-05
T00:00Z#id2

type:5}
type:9}


UserID

Video

Type

id1

10

5

t0

t1

2013-04-05
T00:00Z#id1

type:5}
type:1}

2013-04-05
T00:00Z#id2

type:5}
type:9}


UserID

Video

Type

id1

10

5

id1

11

1

t0

t1

type:5}
type:1}

2013-04-05
T00:00Z#id2

type:5}
type:9}


Video

Type

id1

10

5

id1

11

1

id2

2013-04-05
T00:00Z#id1

UserID

20

5

t0

t1

2013-04-05
T00:00Z#id2

type:5}
type:9}


Type

id1

10

5

id1

11

1

20

5

id2

type:5}
type:1}

Video

id2

2013-04-05
T00:00Z#id1

UserID

25

9

Options for Spark/Cassandra Integration
•

Hadoop InputFormat
–

ColumnFamilyInputFormat - reads all rows from 1 CF

– CqlPagingInputFormat, etc. - CQL3, 2-dary indexes
– Roll your own (join multiple CFs, etc)
•

Spark native RDD
–

sc.parallelize(rowkeys).flatMap(readColumns(_))

– JdbcRdd + Cassandra JDBC driver
•

http://tuplejump.github.io/calliope/


Tips for InputFormat Development
•

Know which target platforms you are developing for
–

•

Which API to write against? New? Old? Both?

Be prepared to spend time tuning your split computation
–

Low latency jobs require fast splits

•

Consider sorting row keys by token for data locality

•

Implement predicate pushdown for HIVE SerDe’s
–

Use your indexes to reduce size of dataset


Example: OLAP processing
Spark

OLAP
Aggregates

Spark

OLAP
Aggregates

t0
2013-04
-05T00:
00Z#id1

{video:
10,
type:5}

2013-04
-05T00:
00Z#id2

{video:
20,
type:5}

C* events


Spark

Query 1: Plays
by Provider
Union

OLAP
Aggregates
Cached Materialized Views

Query 2: Top
content for
mobile

Performance numbers
Spark: C* -> OLAP aggregates
cold cache, 1.4 million events
C* -> OLAP aggregates
warmed cache
OLAP aggregate query via Spark
(56k records)

6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0


130 seconds

20-30 seconds

60 ms

OLAP WorkFlow

Result

Aggregate

Query

Result
Query

REST Job Server

Spark
Executors

Aggregation Job

Cassandra


Dataset

Query Job

Query Job

Fault Tolerance
•

Cached dataset lives in Java Heap only - what if process dies?

•

Spark lineage - automatic recomputation from source, but this is
expensive!

•

Can also replicate cached dataset to survive single node failures

•

Persist materialized views back to C*, then load into cache -- now
recovery path is much faster

•

Persistence also enables multiple processes to hold cached dataset


Demo time


Shark Demo
•

Local shark node, 1 core, MBP

•

How to create a table from C* using our inputformat

•

Creating a cached Shark table

•

Running fast queries


Creating a Shark Table from InputFormat


Creating a cached table


Querying cached table


THANK YOU
•

@evanfchan

•

ev@ooyala.com

•

WE ARE HIRING!!


Spark: Under the hood
Map

Dataset

Map

Map

Reduce

Dataset

Map

Map

Driver

Reduce

Reduce

Dataset

Map

One executor process per node


Driver

Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

More Related Content

What's hot

Similar to Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

More from DataStax Academy

Recently uploaded

Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala