All Your Base

774 views

Published on

Slides from Tim Moreton's talk on "Apache Cassandra and Why BASE is great for real-time analytics" from All Your Base. Nov 23, 2012.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
774
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

All Your Base

  1. 1. Apache Cassandraand why BASE is great forreal-time analyticsTim Moreton
  2. 2. • Cassandra -- What makes it different?• Who’s using it, and for what?• DIY Real Time Analytics on Cassandra• The Easy Option -- Acunu Analytics 2
  3. 3. BigTable Data model Dynamo distribution 3
  4. 4. BigTable Data model Dynamo distribution Incubator, 2009 Top-Level, 2010 Open sourced, 2008 3
  5. 5. BigTable Data model Dynamo distribution Incubator, 2009 Top-Level, 2010 Open sourced, 2008 3
  6. 6. • Multi-master architecture: no SPOF• Tunable consistency, multi-DC aware• High performance, optimised for writes• Atomic counters 4
  7. 7. Data modeluser345: { chess: { lives: 2, score: 33 ... } ...} 5
  8. 8. Data modeluser345: { chess: { Row key lives: 2, Rows arranged randomly around cluster. score: 33 Load balanced, but no ordering. ... Put stuff to access sequentially within a row. } ...} [chess, lives]: [chess, score]: user345 2 44 [go, lives]: [monop, avatar]: [monop, score]: user292 4 top_hat 33 [monop, score]: user188 13 5
  9. 9. Data modeluser345: { chess: { Column key lives: 2, Compound columns allow you to create score: 33 multiple ordered ‘dictionaries’ in a row. ... } ...} [chess, lives]: [chess, score]: user345 2 44 [go, lives]: [monop, avatar]: [monop, score]: user292 4 top_hat 33 [monop, score]: user188 13 6
  10. 10. Data modeluser345: { chess: { Flexible schemas lives: 2, “Columns” are really just cell identifiers. score: 33 ... Rows can be VERY wide. } ...} [chess, lives]: [chess, score]: user345 2 44 [go, lives]: [monop, avatar]: [monop, score]: user292 4 top_hat 33 [monop, score]: user188 13 7
  11. 11. Tunable consistency — per operationWrite:#Replicas ONE QUORUM ALLRead:#Replicas ONE QUORUM ALL 8
  12. 12. Tunable consistency — per operationWrite:#Replicas ONE QUORUM ALL Risk of replica failing, Multiple valuesRead:#Replicas ONE QUORUM ALL 8
  13. 13. Tunable consistency — per operationWrite:#Replicas ONE QUORUM ALL More likely to return out-of-date dataRead:#Replicas ONE QUORUM ALL 8
  14. 14. Tunable consistency — per operationWrite:#Replicas ONE QUORUM ALL Never going to say “ok” if a replica is downRead:#Replicas ONE QUORUM ALL 8
  15. 15. Tunable consistency — per operationWrite:#Replicas ONE QUORUM ALLRead:#Replicas ONE QUORUM ALL 8
  16. 16. Multi data center aware DC 1 DC 2 r1 r2 r1 r2 9
  17. 17. Multi data center aware DC 1 DC 2user345 r1 r2 r1 r2 9
  18. 18. Session Real Time Stores Analytics• Read dominated • Write dominated• Updates to existing items • Updates very rare• Probably fits in RAM • Read “results” mostly• Distribute for availability • Distribute for availability, performance, capacity• Want: Atomicity • Want: Rich queries 10
  19. 19. An analytics app on CassandraSource: Twitter 11
  20. 20. eg: “show me the number of mentionsof ‘Acunu’ per day, between May andNovember 2011, on Twitter”Batch (Hadoop) approach wouldrequire processing ~30 billiontweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.html 12
  21. 21. eg: “show me the number of mentionsof ‘Acunu’ per day, between May andNovember 2011, on Twitter”Batch (Hadoop) approach wouldrequire processing ~30 billiontweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.htmlCassandra approach:For each tweet,increment a bunch of counters,such that answering a queryis as easy as reading some counters 12
  22. 22. 12:32:15 I like #trafficlights 12:33:43 Nobody expects... 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks! [1234, man] +1 [1234, acunu] +1 [1234, rock] +113 Analytics
  23. 23. 12:32:15 I like #trafficlights 12:33:43 Nobody expects... 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks! [1234, man] +1 [1234, acunu] +1 [1234, rock] +1 Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ... Row key is ‘big’ time Column key is ‘small’ bucket time bucket13 Analytics
  24. 24. Solution Con Scalability $$$ Not real time Spartan query semantics: complex, DIY solutions 14
  25. 25. Acunu Analytics High Velocity As events are ingested: ■ Update real time views Event Streams HTTP JSON, MQ, flume ■ Refresh dashboards ■ Preserve original event data 0101 01 0 1 000 10101101 0001110011010 1 10 0 1 1 01 011011 0 01 01 01 010 1 10 1 10 01 010 10 110 10 010 101 00 0 010 01011 0 01 10 0 10 0 11 1110 10 10 10 01 1 10 0 10 01 0110 10 01 10 1 10 10 010 10 0101 0 1010101 01011010 0 00 00 1 1 10 100101010101101 10 10 10 101 01101001 Dashboards and API Provide definitions and real time views: Via the RESTful HTTP API, command line tools, or the UI query builder deliver pre-computed results: create table foo (    x long, ■ Roll-ups    y string, ■ Drilldowns    t time(hour, min),    z path(/) ■ Trends ); create view select sum(x) from foo where y group by z; create view select count from foo where x, t group by t; 15
  26. 26. 16 Analytics
  27. 27. count grouped by ... day16 Analytics
  28. 28. count grouped by ... day count distinct(session)16 Analytics
  29. 29. count grouped by ... day count distinct(session) count16 Analytics
  30. 30. count grouped by ... day count distinct(session) countavg(duration)16 Analytics
  31. 31. count grouped by ... day count distinct(session) count ... geographyavg(duration)16 Analytics
  32. 32. count grouped by ... day count distinct(session) count ... geographyavg(duration) ... browser16 Analytics
  33. 33. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ...{ cust_id: user01, ... ... session_id: 102, UK all→228 user01→1 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ...} UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ...17 Analytics
  34. 34. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :00→19 :02→105 ...{ cust_id: user01, ... ... session_id: 102, UK all→229 user01→2 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ...} UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...18 Analytics
  35. 35. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ... ... ... UK all→228 user01→1 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ...19 Analytics
  36. 36. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :01→19 :02→105 ... ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...20 Analytics
  37. 37. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :01→19 :02→105 ... ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...20 Analytics
  38. 38. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...21 Analytics
  39. 39. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...21 Analytics
  40. 40. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...22 Analytics
  41. 41. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...22 Analytics
  42. 42. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ...count all ∅ all→87315 UK→239 US→354 ...23 Analytics
  43. 43. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ...count all ∅ all→87315 UK→239 US→354 ...23 Analytics
  44. 44. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ...count all ∅ all→87315 UK→239 US→354 ...group all by geo24 Analytics
  45. 45. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ...count all ∅ all→87315 UK→239 US→354 ...group all by geo24 Analytics
  46. 46. DRILLDOWN TOAPPROXIMATE AGGREGATES ORIGINAL EVENTS Identify the root causes ofFast probabilistic data structures for aggregate resultsCOUNT UNIQUE, TOP n to tradeaccuracy for performance - predictably TRENDING AND CORRELATION Proactively identify k deviation from baseline and breaks from trends Accuracy Performance HIERARCHICAL AGGREGATES Automatic handling of paths, timestamps and geospatial queries 25
  47. 47. 26
  48. 48. Shameless plug REAL-TIME BIG DATA ANALYTICS, POWERED BY NOSQL ■ Roll-up and transform cubes in real time ■ Leverage NoSQL for write-optimization,DASHBOARDS UI, JSON APIs schema freedom, and horizontal scalabilityACUNU ANALYTICS CASSANDRA ENHANCED FOR OPS HIGHER DENSITY, LOWER TCO OPSENHANCED CASSANDRA UI ■ Enhanced Cassandra for higher density, UI better scalability, simpler managementCASTLE: STORAGE ENGINE ■ ‘Single pane of glass‘ management UICOMMODITY HW OR CLOUD STORAGE CRAFTED FOR BIG DATA ■ In-kernel storage engine designed and optimised for NoSQL databases 27
  49. 49. http://bit.ly/UBsdej Analytics
  50. 50. THANK YOU @acunu @timmoretonApache, Apache Cassandra, Cassandra, Hadoop, and the eye and elephant logosare trademarks of the Apache Software Foundation. 29

×