Scaling Video AnalyticsWith Apache Cassandra             ILYA MAYKOV | Dec 6th, 2011
AgendaOoyala – quick company overviewWhat do we mean by “video analytics”?What are the challenges?Cassandra at Ooyala - te...
3
4
5
6
7
8
9
10
Analytics Overview                     11
1   Aggregate and Visualize Data2   Give Insights3   Enable experimentation4   Optimize automagically                     ...
Analytics OverviewGo from this …                              13
Analytics Overview   … to this …                     14
Analytics Overview           … and this!                         15
System Architecture                      16
17
State of Analytics TodayCollect vast amounts of dataAggregate, slice in various dimensionsReport and visualizePersonalize ...
Analytics ChallengesScaleProcessing SpeedDepthAccuracyDeveloper speed                               19
Challenge: Scale150M+ unique monthly users15M+ monthly video hoursDaily inflow: billions of log pings, TBs of uncompressed...
Challenge: Processing SpeedLarge “fan-out” to multiple dimensions + per-video-assetanalytics = lots of data being written....
Challenge: DepthPer-video-asset analytics means millions of new rows addedand/or updated in each CF every day10+ dimension...
Challenge: AccuracyPublishers make business decisions based on analytics dataOoyala makes business decisions based on anal...
Challenge: Developer                             SpeedWe’re still a small company with limited developer resourcesLike to ...
Word Count Example: Java                           25
Word Count Example: Ruby                           26
Word Count Example: Scala                            27
Challenge: Developer                            Speed         Word Count MR – Language Comparison                         ...
Why Cassandra?                 29
A bit of history2008 – 2009: Single MySQL DBEarly 2010:  Too much data  Want higher granularity and more ways to slice dat...
Why Cassandra?Linear scaling (space, load) – handles Scale & Depth challengesTunable consistency – QUORUM/QUORUM R/W allow...
Data Model - OverviewRow keys specify the entity and time (and some other stuff …)Column families specify the dimensionCol...
Data Model – Example           CF =>                            Country          Column =>                “CA”            ...
Data Model - TimestampsRow keys have a timestamp componentRow keys have a time granularity componentAllows for efficient q...
Data Model – Timestamps                                  “CA”               “US”         …         { video: 123,          ...
Data Model – MetricsPerformance – plays, displays, unique users, time watched, bytesdownloaded, etcSharing – tweets, faceb...
Data Model - Metrics           CF =>                           Country         Column =>                “CA”              ...
Data Model - DimensionsAnalytics data is sliced in different dimensions == CFsExample: country. Column names are “US”, “CA...
Data Model - Dimensions                    CF: Country                    CF: DMA                     CF: Platform        ...
Data Model – IndicesNeed to efficiently answer “Top N” queries over an aggregate ofmultiple rows, sorted by some field in ...
Data Model – IndicesEvery data row may have 0 or more index rows, depending on themetrics typeIndex rows – empty column va...
Data Model - Indices             CF =>                                  country       Column Name =>              “CA”    ...
Data Model – IndicesTrivial to answer a “Top N” query for a single row if the field we sorton has an index: just read the ...
Data Model – DrilldownsAll cities in the world stored in one row, allowing us to do a globalsort. What if we need cities w...
The Bad StuffRead-modify-write is slow, because in C* read latency >> writelatencyHaving a write-only pipeline would great...
The Bad StuffSynchronous updates of time rollups and index rows makeprocessing slower and increase delaysBut, asynchronous...
LESSONSLEARNED          47
DATA MODEL CHANGES   AREPAINFUL… so design to make them less so                                   48
EVERYTHING   WILLBREAK … so test accordingly                         49
SEPARATE     LOGICALLY     DIFFERENT         DATA… it will improve performance AND make             your life simpler     ...
PERF TEST    WITH PRODUCTION       LOAD… if you can afford a second cluster                                       51
http://cassandra.apache.orghttp://www.datastax.com/dev  http://www.ooyala.com                              52
THANK YOU
Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra
Upcoming SlideShare
Loading in …5
×

Cassandra nyc 2011 ilya maykov - ooyala - scaling video analytics with apache cassandra

3,494 views

Published on

Published in: Technology, Business
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,494
On SlideShare
0
From Embeds
0
Number of Embeds
619
Actions
Shares
0
Downloads
56
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Cassandra nyc 2011 ilya maykov - ooyala - scaling video analytics with apache cassandra

  1. 1. Scaling Video AnalyticsWith Apache Cassandra ILYA MAYKOV | Dec 6th, 2011
  2. 2. AgendaOoyala – quick company overviewWhat do we mean by “video analytics”?What are the challenges?Cassandra at Ooyala - technical detailsLessons learnedQ&A 2
  3. 3. 3
  4. 4. 4
  5. 5. 5
  6. 6. 6
  7. 7. 7
  8. 8. 8
  9. 9. 9
  10. 10. 10
  11. 11. Analytics Overview 11
  12. 12. 1 Aggregate and Visualize Data2 Give Insights3 Enable experimentation4 Optimize automagically 12
  13. 13. Analytics OverviewGo from this … 13
  14. 14. Analytics Overview … to this … 14
  15. 15. Analytics Overview … and this! 15
  16. 16. System Architecture 16
  17. 17. 17
  18. 18. State of Analytics TodayCollect vast amounts of dataAggregate, slice in various dimensionsReport and visualizePersonalize and recommendScalable, fault tolerant, near real-timeusing Hadoop + Cassandra 18
  19. 19. Analytics ChallengesScaleProcessing SpeedDepthAccuracyDeveloper speed 19
  20. 20. Challenge: Scale150M+ unique monthly users15M+ monthly video hoursDaily inflow: billions of log pings, TBs of uncompressed logs10TB+ of historical analytics data in C* covering a period ofabout 4 yearsExponential data growth in C*: currently 1TB+ per month 20
  21. 21. Challenge: Processing SpeedLarge “fan-out” to multiple dimensions + per-video-assetanalytics = lots of data being written. Parallelizable!“Analytics delay” metric = time from log ping hitting a server tobeing visible to a publisher in the analytics UICurrent avg. delay: 10-25 minutes depending on time of dayTarget max analytics delay: <30 minutes (Hadoop system)Would like <1 minute (future real-time processing system) 21
  22. 22. Challenge: DepthPer-video-asset analytics means millions of new rows addedand/or updated in each CF every day10+ dimensions (CFs) for slicing data in different waysQueries range from “everything in my account for all time” to “videoX in city Y on date Z”We’d like 1-hour granularity, but that’s up to 24x more rowsOr even 1-minute granularity in real-time, but that could be >1000xmore rows … 22
  23. 23. Challenge: AccuracyPublishers make business decisions based on analytics dataOoyala makes business decisions based on analytics dataOoyala bills publishers based on analytics dataAnalytics need to be accurate and verifiable 23
  24. 24. Challenge: Developer SpeedWe’re still a small company with limited developer resourcesLike to iterate fast and release often, but …… we use Hadoop MR for large-scale data processingHadoop is a Java frameworkSo, MapReduce jobs have to be written in Java … right? 24
  25. 25. Word Count Example: Java 25
  26. 26. Word Count Example: Ruby 26
  27. 27. Word Count Example: Scala 27
  28. 28. Challenge: Developer Speed Word Count MR – Language Comparison Development Runtime Hadoop Lines Characters Speed Speed APIJava 69 2395 Low High NativeRuby 30 738 High Low StreamingScala 35 1284 Medium High Native 28
  29. 29. Why Cassandra? 29
  30. 30. A bit of history2008 – 2009: Single MySQL DBEarly 2010: Too much data Want higher granularity and more ways to slice data Need a scalable data store! 30
  31. 31. Why Cassandra?Linear scaling (space, load) – handles Scale & Depth challengesTunable consistency – QUORUM/QUORUM R/W allows accuracyVery fast writes, reasonably fast readsGreat community support, rapidly evolving and improvingcodebase – 0.6.13 => 0.8.7 increased our performance by >4xSimpler and fewer dependencies than Hbase, richer data modelthan a simple K/V store, more scalable than an RDBMS, … 31
  32. 32. Data Model - OverviewRow keys specify the entity and time (and some other stuff …)Column families specify the dimensionColumn names specify a data point within that dimensionColumn values are maps of key/value pairs that represent acollection of related metricsDifferent groups of related metrics are stored under different rowkeys 32
  33. 33. Data Model – Example CF => Country Column => “CA” “US” … { displays: 50, { displays: 100, {video: 123, … } … plays: 40, … } plays: 75, … } { displays: 5000, { displays: 1100,Keys {publisher: 456, … } plays: 4100, … } plays: 756, … } … … … … … 33
  34. 34. Data Model - TimestampsRow keys have a timestamp componentRow keys have a time granularity componentAllows for efficient queries over large time ranges (few row keyswith big numbers)Preserves granularity at smaller time rangesCurrently Month/Week/Day. Maybe Hour/Minute in the future? 34
  35. 35. Data Model – Timestamps “CA” “US” … { video: 123, { plays: 1, … } { plays: 1, … } … day: 2011/10/31 } { video: 123, { plays: 2, … } { plays: 1, … } … day: 2011/11/01 } { video: 123, { plays: 4, … } null … day: 2011/11/02 } { video: 123, { plays: 8, … } { plays: 1, … } … day: 2011/11/03 }Keys { video: 123, { plays: 16, … } { plays: 1, … } … day: 2011/11/04 } { video: 123, { plays: 32, … } { plays: 1, … } … day: 2011/11/05 } { video: 123, { plays: 64, … } { plays: 1, … } … day: 2011/11/06 } { video: 123, { plays: 127, … } { plays: 6, … } … week: 2011/10/31 } 35
  36. 36. Data Model – MetricsPerformance – plays, displays, unique users, time watched, bytesdownloaded, etcSharing – tweets, facebook shares, diggs, etcEngagement – how many users watched through certain timebuckets of a videoQoS – bitrates, buffering eventsAd – ad requests, impressions, clicks, mouse-overs, failures, etc 36
  37. 37. Data Model - Metrics CF => Country Column => “CA” “US” … {video: 123, { displays: 50, { displays: 100, … metrics: video, … } plays: 40, … } plays: 75, … } { clicks: 3, { clicks: 7, {video: 123,Keys metrics: ad, … } impressions: 40, impressions: 61, … …} …} … … … … 37
  38. 38. Data Model - DimensionsAnalytics data is sliced in different dimensions == CFsExample: country. Column names are “US”, “CA”, “JP”, etcColumn values are aggregates of the metric for the row key in thatcountryFor example: the video performance metrics for month of 2011-10-01 in the US for video asset 123Example: platform. Column names: “desktop:windows:chrome”,“tablet:ipad”, “mobile:android”, “settop:ps3”. 38
  39. 39. Data Model - Dimensions CF: Country CF: DMA CF: Platform “SF Bay “desktop:mac:c “CA” “US” “NYC” “settop:ps3” Area” hrome”Key: {video: { plays: 20, { plays: 30, { plays: 12, { plays: 5, { plays: 7, … { plays: 60, … } 123, …} …} …} …} …} } 39
  40. 40. Data Model – IndicesNeed to efficiently answer “Top N” queries over an aggregate ofmultiple rows, sorted by some field in the metrics objectBut, column sort order is “CA” < “JP” < “US” regardless of fieldvaluesWould like to support multiple fields to sort on, anywayNaïve implementation – read entire rows, aggregate, sort in RAM –pretty slowSolution: write additional index rows to C* 40
  41. 41. Data Model – IndicesEvery data row may have 0 or more index rows, depending on themetrics typeIndex rows – empty column values, column names are prependedwith the value of the indexed field, encoded as a fixed-width bytearrayRely on C* to order the columns according to the indexed fieldIndex rows are stored in separate CFs which have “i_” prependedto the dimension name. 41
  42. 42. Data Model - Indices CF => country Column Name => “CA” “US” … { displays: 50, { displays: 100, {video: 123, …} … plays: 40, … } plays: 75, … }Keys { displays: 5000, { displays: 1100, {publisher: 456, …} … plays: 4100, … } plays: 756, … } CF => i_country {video: 123, Name: “40:CA” Name: “75:US” … index: plays} Value: null Value: null Name: Name: {publisher: 456,Keys “5000:CA” “1100:US” … index: displays} Value: null Value: null … … … … 42
  43. 43. Data Model – IndicesTrivial to answer a “Top N” query for a single row if the field we sorton has an index: just read the last N columns of the index rowWhat if the query spans multiple rows?Use 3-pass uniform threshold algorithm. Guaranteed to get the top-N columns in any multi-row aggregate in 3 RPC calls. See:[http://www.cs.ucsb.edu/research/tech_reports/reports/2005-14.pdf]Has some drawbacks: can’t do bottom-N, computing top-N-to-2N isimpossible, have to do top-2N and drop half. 43
  44. 44. Data Model – DrilldownsAll cities in the world stored in one row, allowing us to do a globalsort. What if we need cities within some region only?Solution: use “drilldown” indices.Just a special kind of index that includes only a subset of all data inthe parent row.Example: all cities in the country “US”Works like regular index otherwiseNot free – more than 1/3rd of all our C* disk usage 44
  45. 45. The Bad StuffRead-modify-write is slow, because in C* read latency >> writelatencyHaving a write-only pipeline would greatly speed up processing,but makes reading data more expensive (aggregate-on-read)And/or requires more complicated asynchronous aggregationMinimum granularity of 1 day is not that good, would like to do 1-hour or 1-minuteBut, storage requirements go up very fast 45
  46. 46. The Bad StuffSynchronous updates of time rollups and index rows makeprocessing slower and increase delaysBut, asynchronous is harder to get rightReprocessing of data is currently difficult because of lack of locking– have to pause regular pipelineAlso have to reprocess log files in batches of full days 46
  47. 47. LESSONSLEARNED 47
  48. 48. DATA MODEL CHANGES AREPAINFUL… so design to make them less so 48
  49. 49. EVERYTHING WILLBREAK … so test accordingly 49
  50. 50. SEPARATE LOGICALLY DIFFERENT DATA… it will improve performance AND make your life simpler 50
  51. 51. PERF TEST WITH PRODUCTION LOAD… if you can afford a second cluster 51
  52. 52. http://cassandra.apache.orghttp://www.datastax.com/dev http://www.ooyala.com 52
  53. 53. THANK YOU

×