C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan Chan
Upcoming SlideShare
Loading in...5
×
 

C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan Chan

on

  • 10,824 views

This session covers our experience with using the Spark and Shark frameworks for running real-time queries on top of Cassandra data.We will start by surveying the current Cassandra analytics ...

This session covers our experience with using the Spark and Shark frameworks for running real-time queries on top of Cassandra data.We will start by surveying the current Cassandra analytics landscape, including Hadoop and HIVE, and touch on the use of custom input formats to extract data from Cassandra. We will then dive into Spark and Shark, two memory-based cluster computing frameworks, and how they enable often dramatic improvements in query speed and productivity, over the standard solutions today.

Statistics

Views

Total Views
10,824
Views on SlideShare
6,569
Embed Views
4,255

Actions

Likes
13
Downloads
170
Comments
0

28 Embeds 4,255

http://planetcassandra.org 3448
http://www.planetcassandra.org 288
https://twitter.com 152
http://www.rosebt.com 130
http://blog.rieger-net.com 65
http://cloud.feedly.com 50
http://www.weebly.com 41
http://www.datascienceassn.org 14
http://www.feedspot.com 8
http://localhost 7
https://www.rebelmouse.com 7
http://digg.com 6
http://planetcassandra.com 6
http://planetca.w11.wh-2.com 6
http://www.google.com 3
https://www.google.fr 3
http://newsblur.com 3
http://www.newsblur.com 3
https://www.google.com 2
http://www.planetcassandra.com 2
http://www.google.dk 2
http://www.google.ca 2
http://www.google.co.uk 2
http://tweetedtimes.com 1
http://www.planetcassandra.net 1
http://reader.aol.com 1
http://reader.faltering.com 1
http://webcache.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan Chan C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan Chan Presentation Transcript

  • Real-time Analytics withCassandra, Spark and Shark
  • Who is this guy• Staff Engineer, Compute and Data Services, Ooyala• Building multiple web-scale real-time systems on top of C*, Kafka,Storm, etc.• Scala/Akka guy• Very excited by open source, big data projects - share some today• @evanfchan
  • Agenda• Ooyala and Cassandra• What problem are we trying to solve?• Spark and Shark• Our Spark/Cassandra Architecture• Demo
  • Cassandra at OoyalaWho is Ooyala, and how we use Cassandra
  • CONFIDENTIAL—DO NOT DISTRIBUTEOOYALAPowering personalized videoexperiences across all screens.5
  • CONFIDENTIAL—DO NOT DISTRIBUTE 6CONFIDENTIAL—DO NOT DISTRIBUTEFounded in 2007Commercially launch in 2009230+ employees in Silicon Valley, LA, NYC,London, Paris, Tokyo, Sydney & GuadalajaraGlobal footprint, 200M unique users,110+ countries, and more than 6,000 websitesOver 1 billion videos played per monthand 2 billion analytic events per day25% of U.S. online viewers watch videopowered by OoyalaCOMPANY OVERVIEW
  • CONFIDENTIAL—DO NOT DISTRIBUTE 7TRUSTED VIDEO PARTNERSTRATEGIC PARTNERSCUSTOMERSCONFIDENTIAL—DO NOT DISTRIBUTE
  • We are a large Cassandra user• 12 clusters ranging in size from 3 to 115 nodes• Total of 28TB of data managed over ~200 nodes• Largest cluster - 115 nodes, 1.92PB storage, 15TBRAM• Over 2 billion C* column writes per day• Powers all of our analytics infrastructure
  • What problem are we trying tosolve?Lots of data, complex queries, answered really quickly... but how??
  • From mountains of useless data...
  • To nuggets of truth...
  • To nuggets of truth...• Quickly• Painlessly• At  scale?
  • Today: Precomputed aggregates• Video metrics computed along several high cardinality dimensions• Very fast lookups, but inflexible, and hard to change• Most computed aggregates are never read• What if we need more dynamic queries?– Top content for mobile users in France– Engagement curves for users who watched recommendations– Data mining, trends, machine learning
  • The static - dynamic continuum• Super fast lookups• Inflexible, wasteful• Best for 80% mostcommon queries• Always compute resultsfrom raw data• Flexible but slow100% Precomputation 100% Dynamic
  • Where we want to bePartly dynamic• Pre-aggregate mostcommon queries• Flexible, fast dynamicqueries• Easily generate manymaterialized views
  • Industry Trends• Fast execution frameworks– Impala• In-memory databases– VoltDB, Druid• Streaming and real-time• Higher-level, productive data frameworks– Cascading, Hive, Pig
  • Why Spark and Shark?“Lightning-fast in-memory cluster computing”
  • Introduction to Spark• In-memory distributed computing framework• Created by UC Berkeley AMP Lab in 2010• Targeted problems that MR is bad at:– Iterative algorithms (machine learning)– Interactive data mining• More general purpose than Hadoop MR• Active contributions from ~ 15 companies
  • HDFSMapReduceMapReducemap()join()cache()transform
  • Throughput: Memory is king0 37500 75000 112500 150000C*, cold cacheC*, warm cacheSpark RDD6-­‐node  C*/DSE  1.1.9  cluster,Spark  0.7.0
  • Developers love it• “I wrote my first aggregation job in 30 minutes”• High level “distributed collections” API• No Hadoop cruft• Full power of Scala, Java, Python• Interactive REPL shell• EASY testing!!• Low latency - quick development cycles
  • Spark word count examplefile = spark.textFile("hdfs://...") file.flatMap(line => line.split(" "))    .map(word => (word, 1))    .reduceByKey(_ + _)1 package org.myorg;23 import java.io.IOException;4 import java.util.*;56 import org.apache.hadoop.fs.Path;7 import org.apache.hadoop.conf.*;8 import org.apache.hadoop.io.*;9 import org.apache.hadoop.mapreduce.*;10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;1415 public class WordCount {1617 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {18 private final static IntWritable one = new IntWritable(1);19 private Text word = new Text();2021 public void map(LongWritable key, Text value, Context context) throws IOException,InterruptedException {22 String line = value.toString();23 StringTokenizer tokenizer = new StringTokenizer(line);24 while (tokenizer.hasMoreTokens()) {25 word.set(tokenizer.nextToken());26 context.write(word, one);27 }28 }29 }3031 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {3233 public void reduce(Text key, Iterable<IntWritable> values, Context context)34 throws IOException, InterruptedException {35 int sum = 0;36 for (IntWritable val : values) {37 sum += val.get();38 }39 context.write(key, new IntWritable(sum));40 }41 }4243 public static void main(String[] args) throws Exception {44 Configuration conf = new Configuration();4546 Job job = new Job(conf, "wordcount");4748 job.setOutputKeyClass(Text.class);49 job.setOutputValueClass(IntWritable.class);5051 job.setMapperClass(Map.class);52 job.setReducerClass(Reduce.class);5354 job.setInputFormatClass(TextInputFormat.class);55 job.setOutputFormatClass(TextOutputFormat.class);5657 FileInputFormat.addInputPath(job, new Path(args[0]));58 FileOutputFormat.setOutputPath(job, new Path(args[1]));5960 job.waitForCompletion(true);61 }6263 }
  • The Spark EcosystemBagel  -­‐  Pregel  on  SparkHIVE  on  SparkSpark  Streaming  -­‐  discreRzed  stream  processingSparkTachyon  -­‐  in-­‐memory  caching  DFS
  • Shark - HIVE on Spark• 100% HiveQL compatible• 10-100x faster than HIVE, answers in seconds• Reuse UDFs, SerDe’s, StorageHandlers• Can use DSE / CassandraFS for Metastore• Easy Scala/Java integration via Spark - easier thanwriting UDFs
  • Our new analytics architectureHow we integrate Cassandra and Spark/Shark
  • From raw events to fast queriesIngesRon C*event  storeRaw  EventsRaw  EventsRaw  EventsSparkSparkSparkView  1View  2View  3SparkSharkPredefined  queriesAd-­‐hoc  HiveQL
  • Our Spark/Shark/Cassandra StackNode1CassandraInputFormatSerDeSpark  WorkerSharkNode2CassandraInputFormatSerDeSpark  WorkerSharkNode3CassandraInputFormatSerDeSpark  WorkerSharkSpark  Master Job  Server
  • Event Store Cassandra schemat0 t1 t2 t3 t42013-­‐04-­‐05T00:00Z#id1{event0:a0}{event1:a1}{event2:a2}{event3:a3}{event4:a4}ipaddr:10.20.30.40:t1 videoId:45678:t1 providerId:500:t02013-­‐04-­‐05T00:00Z#id1Event  CFEventAfr  CF
  • Unpacking raw eventst0 t12013-­‐04-­‐05T00:00Z#id1{video: 10,type:5}{video: 11,type:1}2013-­‐04-­‐05T00:00Z#id2{video: 20,type:5}{video: 25,type:9}UserID Video Typeid1 10 5
  • Unpacking raw eventst0 t12013-­‐04-­‐05T00:00Z#id1{video: 10,type:5}{video: 11,type:1}2013-­‐04-­‐05T00:00Z#id2{video: 20,type:5}{video: 25,type:9}UserID Video Typeid1 10 5id1 11 1
  • Unpacking raw eventst0 t12013-­‐04-­‐05T00:00Z#id1{video: 10,type:5}{video: 11,type:1}2013-­‐04-­‐05T00:00Z#id2{video: 20,type:5}{video: 25,type:9}UserID Video Typeid1 10 5id1 11 1id2 20 5
  • Unpacking raw eventst0 t12013-­‐04-­‐05T00:00Z#id1{video: 10,type:5}{video: 11,type:1}2013-­‐04-­‐05T00:00Z#id2{video: 20,type:5}{video: 25,type:9}UserID Video Typeid1 10 5id1 11 1id2 20 5id2 25 9
  • Tips for InputFormat Development• Know which target platforms you are developing for– Which API to write against? New? Old? Both?• Be prepared to spend time tuning your split computation– Low latency jobs require fast splits• Consider sorting row keys by token for data locality• Implement predicate pushdown for HIVE SerDe’s– Use your indexes to reduce size of dataset
  • Example: OLAP processingt02013-­‐04-­‐05T00:00Z#id1{video:10,type:5}2013-­‐04-­‐05T00:00Z#id2{video:20,type:5}C*  eventsOLAP  AggregatesOLAP  AggregatesOLAP  AggregatesCached  Materialized  ViewsSparkSparkSparkUnionQuery  1:  Plays  by  ProviderQuery  2:  Top  content  for  mobile
  • Performance numbersSpark:  C*  -­‐>  OLAP  aggregatescold  cache,  1.4  million  events130  secondsC*  -­‐>  OLAP  aggregateswarmed  cache20-­‐30  secondsOLAP  aggregate  query  via  Spark(56k  records)60  ms6-­‐node  C*/DSE  1.1.9  cluster,Spark  0.7.0
  • Spark: Under the hoodMap DatasetReduce MapDriver Map DatasetReduce MapMap DatasetReduce MapOne  executor  process  per  nodeDriver
  • Fault Tolerance• Cached dataset lives in Java Heap only - what if process dies?• Spark lineage - automatic recomputation from source, but this isexpensive!• Can also replicate cached dataset to survive single node failures• Persist materialized views back to C*, then load into cache -- nowrecovery path is much faster• Persistence also enables multiple processes to hold cached dataset
  • Demo time
  • Shark Demo• Local shark node, 1 core, MBP• How to create a table from C* using our inputformat• Creating a cached Shark table• Running fast queries
  • Backup Slides• THE NEXT FEW SLIDES ARE STRICTLY BACKUP IN CASELIVE DEMO DOESN’T WORK
  • Creating a Shark Table from InputFormat
  • Creating a cached table
  • Doing a row count ... don’t try in HIVE!
  • Top k providerId query
  • THANK YOU