Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Upcoming SlideShare
Loading in...5
×
 

Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

on

  • 5,736 views

What You Will Learn At This Meetup: ...

What You Will Learn At This Meetup:
• Review of Cassandra analytics landscape: Hadoop & HIVE
• Custom input formats to extract data from Cassandra
• How Spark & Shark increase query speed & productivity over standard solutions

Abstract
This session covers our experience with using the Spark and Shark frameworks for running real-time queries on top of Cassandra data.We will start by surveying the current Cassandra analytics landscape, including Hadoop and HIVE, and touch on the use of custom input formats to extract data from Cassandra. We will then dive into Spark and Shark, two memory-based cluster computing frameworks, and how they enable often dramatic improvements in query speed and productivity, over the standard solutions today.

About Evan Chan
Evan Chan is a Software Engineer at Ooyala. In his own words: I love to design, build, and improve bleeding edge distributed data and backend systems using the latest in open source technologies. I am a big believer in GitHub, open source, and meetups, and have given talks at conferences such as the Cassandra Summit 2013.

South Bay Cassandra Meetup URL: http://www.meetup.com/DataStax-Cassandra-South-Bay-Users/events/147443722/

Statistics

Views

Total Views
5,736
Views on SlideShare
5,722
Embed Views
14

Actions

Likes
9
Downloads
125
Comments
0

5 Embeds 14

https://twitter.com 5
http://localhost 4
http://www.linkedin.com 2
http://www.slideee.com 2
http://23.253.69.203 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala Presentation Transcript

  • Real-time Analytics with Cassandra, Spark and Shark Saturday, December 7, 13
  • Who is this guy • Staff Engineer, Compute and Data Services, Ooyala • Building multiple web-scale real-time systems on top of C*, Kafka, Storm, etc. • Scala/Akka guy • github.com/velvia • @evanfchan Saturday, December 7, 13
  • Agenda • Ooyala and Cassandra • What problem are we trying to solve? • Spark and Shark • Our Spark/Cassandra Architecture • Demo Saturday, December 7, 13
  • Cassandra at Ooyala Who is Ooyala, and how we use Cassandra Saturday, December 7, 13
  • OOYALA Powering personalized video experiences across all screens. CONFIDENTIAL—DO NOT DISTRIBUTE Saturday, December 7, 13 5
  • COMPANY OVERVIEW Founded in 2007 Commercially launch in 2009 300 employees in Silicon Valley, LA, NYC, London, Paris, Tokyo, Sydney & Guadalajara Global footprint, 200M unique users, 110+ countries, and more than 6,000 websites Over 1 billion videos played per month and 2 billion analytic events per day 25% of U.S. online viewers watch video powered by Ooyala CONFIDENTIAL—DO NOT DISTRIBUTE Saturday, December 7, 13 6
  • TRUSTED VIDEO PARTNER CUSTOMERS STRATEGIC PARTNERS CONFIDENTIAL—DO NOT DISTRIBUTE Saturday, December 7, 13 7
  • We are a large Cassandra user • 12 clusters ranging in size from 3 to 107 nodes • Total of 28TB of data managed over ~220 nodes • Over 2 billion C* column writes per day • Powers all of our analytics infrastructure • DSE/C* 1.0.x, 1.1.x, 1.2.6 • Large prod cluster is one of the biggest Cassandra installations Saturday, December 7, 13
  • What problem are we trying to solve? Lots of data, complex queries, answered really quickly... but how?? Saturday, December 7, 13
  • From mountains of raw data... Saturday, December 7, 13
  • To nuggets of truth... • Quickly • Painlessly • At Saturday, December 7, 13 scale?
  • Today: Precomputed aggregates • Video metrics computed along several high cardinality dimensions • Very fast lookups, but inflexible, and hard to change • Most computed aggregates are never read • What if we need more dynamic queries? – Top content for mobile users in France – Engagement curves for users who watched recommendations – Data mining, trends, machine learning Saturday, December 7, 13
  • The static - dynamic continuum 100% Precomputation • Super fast lookups • Inflexible, wasteful • Best for 80% most common queries 100% Dynamic Saturday, December 7, 13 • Always compute results from raw data • Flexible but slow
  • Where we want to be Partly dynamic • • Flexible, fast dynamic queries • Saturday, December 7, 13 Pre-aggregate most common queries Easily generate many materialized views
  • Industry Trends • Fast execution frameworks – • Impala, Drill, Presto In-memory databases – VoltDB, Druid • Streaming and real-time • Higher-level, productive data frameworks – Cascading, Hive, Pig Saturday, December 7, 13
  • Why Spark and Shark? “Lightning-fast in-memory cluster computing” Saturday, December 7, 13
  • Introduction to Spark • In-memory distributed computing framework • Created by UC Berkeley AMP Lab in 2010 • Targeted problems that MR is bad at: – Iterative algorithms (machine learning) – Interactive data mining • More general purpose than Hadoop MR • Active contributions from ~ 15 companies Saturday, December 7, 13
  • Data Source Source 2 Map map() Reduce join() HDFS Map Reduce Saturday, December 7, 13 cache() transform
  • Throughput: Memory is king 0 C*, cold cache C*, warm cache Spark RDD 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Saturday, December 7, 13 37500 75000 112500 150000
  • Developers love it • “I wrote my first aggregation job in 30 minutes” • High level “distributed collections” API • No Hadoop cruft • Full power of Scala, Java, Python • Interactive REPL shell • EASY testing!! • Low latency - quick development cycles Saturday, December 7, 13
  • Spark word count example file = spark.textFile("hdfs://...")   file.flatMap(line => line.split(" "))     .map(word => (word, 1))     .reduceByKey(_ + _) Saturday, December 7, 13 1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 }
  • The Spark Ecosystem Bagel Pregel on Spark HIVE on Spark Spark Streaming discretized stream processing Spark Tachyon - in-memory caching DFS Saturday, December 7, 13
  • Shark - HIVE on Spark • 100% HiveQL compatible • 10-100x faster than HIVE, answers in seconds • Reuse UDFs, SerDe’s, StorageHandlers • Can use DSE / CassandraFS for Metastore • Easy Scala/Java integration via Spark - easier than writing UDFs Saturday, December 7, 13
  • Our new analytics architecture How we integrate Cassandra and Spark/Shark Saturday, December 7, 13
  • From raw events to fast queries Raw Events Raw Events Spark Ingestion Raw Events Saturday, December 7, 13 C* event store View 1 Spark Predefined queries Spark View 2 Shark Ad-hoc HiveQL Spark View 3
  • Our Spark/Shark/Cassandra Stack Spark Master Shark Spark Worker Shark Spark Worker SerDe Job Server Shark Spark Worker SerDe SerDe InputFormat InputFormat InputFormat Cassandra Cassandra Cassandra Node1 Node2 Node3 Saturday, December 7, 13
  • Event Store Cassandra schema Event CF t0 2013-04-05 T00:00Z#id1 t1 t2 t3 t4 {event0: a0} {event1: a1} {event2: a2} {event3: a3} {event4: a4} EventAttr CF ipaddr:10.20.30.40:t1 2013-04-05 T00:00Z#id1 Saturday, December 7, 13 videoId:45678:t1 providerId:500:t0
  • Unpacking raw events t0 t1 2013-04-05 T00:00Z#id1 {video: 10, {video: 11, type:5} type:1} 2013-04-05 T00:00Z#id2 {video: 20, {video: 25, type:5} type:9} Saturday, December 7, 13 UserID Video Type id1 10 5
  • Unpacking raw events t0 t1 2013-04-05 T00:00Z#id1 {video: 10, {video: 11, type:5} type:1} 2013-04-05 T00:00Z#id2 {video: 20, {video: 25, type:5} type:9} Saturday, December 7, 13 UserID Video Type id1 10 5 id1 11 1
  • Unpacking raw events t0 t1 {video: 10, {video: 11, type:5} type:1} 2013-04-05 T00:00Z#id2 {video: 20, {video: 25, type:5} type:9} Saturday, December 7, 13 Video Type id1 10 5 id1 11 1 id2 2013-04-05 T00:00Z#id1 UserID 20 5
  • Unpacking raw events t0 t1 2013-04-05 T00:00Z#id2 {video: 20, {video: 25, type:5} type:9} Saturday, December 7, 13 Type id1 10 5 id1 11 1 20 5 id2 {video: 10, {video: 11, type:5} type:1} Video id2 2013-04-05 T00:00Z#id1 UserID 25 9
  • Options for Spark/Cassandra Integration • Hadoop InputFormat – ColumnFamilyInputFormat - reads all rows from 1 CF – CqlPagingInputFormat, etc. - CQL3, 2-dary indexes – Roll your own (join multiple CFs, etc) • Spark native RDD – sc.parallelize(rowkeys).flatMap(readColumns(_)) – JdbcRdd + Cassandra JDBC driver • http://tuplejump.github.io/calliope/ Saturday, December 7, 13
  • Tips for InputFormat Development • Know which target platforms you are developing for – • Which API to write against? New? Old? Both? Be prepared to spend time tuning your split computation – Low latency jobs require fast splits • Consider sorting row keys by token for data locality • Implement predicate pushdown for HIVE SerDe’s – Use your indexes to reduce size of dataset Saturday, December 7, 13
  • Example: OLAP processing Spark OLAP Aggregates Spark OLAP Aggregates t0 2013-04 -05T00: 00Z#id1 {video: 10, type:5} 2013-04 -05T00: 00Z#id2 {video: 20, type:5} C* events Saturday, December 7, 13 Spark Query 1: Plays by Provider Union OLAP Aggregates Cached Materialized Views Query 2: Top content for mobile
  • Performance numbers Spark: C* -> OLAP aggregates cold cache, 1.4 million events C* -> OLAP aggregates warmed cache OLAP aggregate query via Spark (56k records) 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Saturday, December 7, 13 130 seconds 20-30 seconds 60 ms
  • OLAP WorkFlow Result Aggregate Query Result Query REST Job Server Spark Executors Aggregation Job Cassandra Saturday, December 7, 13 Dataset Query Job Query Job
  • Fault Tolerance • Cached dataset lives in Java Heap only - what if process dies? • Spark lineage - automatic recomputation from source, but this is expensive! • Can also replicate cached dataset to survive single node failures • Persist materialized views back to C*, then load into cache -- now recovery path is much faster • Persistence also enables multiple processes to hold cached dataset Saturday, December 7, 13
  • Demo time Saturday, December 7, 13
  • Shark Demo • Local shark node, 1 core, MBP • How to create a table from C* using our inputformat • Creating a cached Shark table • Running fast queries Saturday, December 7, 13
  • Creating a Shark Table from InputFormat Saturday, December 7, 13
  • Creating a cached table Saturday, December 7, 13
  • Querying cached table Saturday, December 7, 13
  • THANK YOU • @evanfchan • ev@ooyala.com • WE ARE HIRING!! Saturday, December 7, 13
  • Spark: Under the hood Map Dataset Map Map Reduce Dataset Map Map Driver Reduce Reduce Dataset Map One executor process per node Saturday, December 7, 13 Driver