All Your Base
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

All Your Base

on

  • 782 views

Slides from Tim Moreton's talk on "Apache Cassandra and Why BASE is great for real-time analytics" from All Your Base. Nov 23, 2012.

Slides from Tim Moreton's talk on "Apache Cassandra and Why BASE is great for real-time analytics" from All Your Base. Nov 23, 2012.

Statistics

Views

Total Views
782
Views on SlideShare
782
Embed Views
0

Actions

Likes
0
Downloads
18
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

All Your Base Presentation Transcript

  • 1. Apache Cassandraand why BASE is great forreal-time analyticsTim Moreton
  • 2. • Cassandra -- What makes it different?• Who’s using it, and for what?• DIY Real Time Analytics on Cassandra• The Easy Option -- Acunu Analytics 2
  • 3. BigTable Data model Dynamo distribution 3
  • 4. BigTable Data model Dynamo distribution Incubator, 2009 Top-Level, 2010 Open sourced, 2008 3
  • 5. BigTable Data model Dynamo distribution Incubator, 2009 Top-Level, 2010 Open sourced, 2008 3
  • 6. • Multi-master architecture: no SPOF• Tunable consistency, multi-DC aware• High performance, optimised for writes• Atomic counters 4
  • 7. Data modeluser345: { chess: { lives: 2, score: 33 ... } ...} 5
  • 8. Data modeluser345: { chess: { Row key lives: 2, Rows arranged randomly around cluster. score: 33 Load balanced, but no ordering. ... Put stuff to access sequentially within a row. } ...} [chess, lives]: [chess, score]: user345 2 44 [go, lives]: [monop, avatar]: [monop, score]: user292 4 top_hat 33 [monop, score]: user188 13 5
  • 9. Data modeluser345: { chess: { Column key lives: 2, Compound columns allow you to create score: 33 multiple ordered ‘dictionaries’ in a row. ... } ...} [chess, lives]: [chess, score]: user345 2 44 [go, lives]: [monop, avatar]: [monop, score]: user292 4 top_hat 33 [monop, score]: user188 13 6
  • 10. Data modeluser345: { chess: { Flexible schemas lives: 2, “Columns” are really just cell identifiers. score: 33 ... Rows can be VERY wide. } ...} [chess, lives]: [chess, score]: user345 2 44 [go, lives]: [monop, avatar]: [monop, score]: user292 4 top_hat 33 [monop, score]: user188 13 7
  • 11. Tunable consistency — per operationWrite:#Replicas ONE QUORUM ALLRead:#Replicas ONE QUORUM ALL 8
  • 12. Tunable consistency — per operationWrite:#Replicas ONE QUORUM ALL Risk of replica failing, Multiple valuesRead:#Replicas ONE QUORUM ALL 8
  • 13. Tunable consistency — per operationWrite:#Replicas ONE QUORUM ALL More likely to return out-of-date dataRead:#Replicas ONE QUORUM ALL 8
  • 14. Tunable consistency — per operationWrite:#Replicas ONE QUORUM ALL Never going to say “ok” if a replica is downRead:#Replicas ONE QUORUM ALL 8
  • 15. Tunable consistency — per operationWrite:#Replicas ONE QUORUM ALLRead:#Replicas ONE QUORUM ALL 8
  • 16. Multi data center aware DC 1 DC 2 r1 r2 r1 r2 9
  • 17. Multi data center aware DC 1 DC 2user345 r1 r2 r1 r2 9
  • 18. Session Real Time Stores Analytics• Read dominated • Write dominated• Updates to existing items • Updates very rare• Probably fits in RAM • Read “results” mostly• Distribute for availability • Distribute for availability, performance, capacity• Want: Atomicity • Want: Rich queries 10
  • 19. An analytics app on CassandraSource: Twitter 11
  • 20. eg: “show me the number of mentionsof ‘Acunu’ per day, between May andNovember 2011, on Twitter”Batch (Hadoop) approach wouldrequire processing ~30 billiontweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.html 12
  • 21. eg: “show me the number of mentionsof ‘Acunu’ per day, between May andNovember 2011, on Twitter”Batch (Hadoop) approach wouldrequire processing ~30 billiontweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.htmlCassandra approach:For each tweet,increment a bunch of counters,such that answering a queryis as easy as reading some counters 12
  • 22. 12:32:15 I like #trafficlights 12:33:43 Nobody expects... 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks! [1234, man] +1 [1234, acunu] +1 [1234, rock] +113 Analytics
  • 23. 12:32:15 I like #trafficlights 12:33:43 Nobody expects... 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks! [1234, man] +1 [1234, acunu] +1 [1234, rock] +1 Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ... Row key is ‘big’ time Column key is ‘small’ bucket time bucket13 Analytics
  • 24. Solution Con Scalability $$$ Not real time Spartan query semantics: complex, DIY solutions 14
  • 25. Acunu Analytics High Velocity As events are ingested: ■ Update real time views Event Streams HTTP JSON, MQ, flume ■ Refresh dashboards ■ Preserve original event data 0101 01 0 1 000 10101101 0001110011010 1 10 0 1 1 01 011011 0 01 01 01 010 1 10 1 10 01 010 10 110 10 010 101 00 0 010 01011 0 01 10 0 10 0 11 1110 10 10 10 01 1 10 0 10 01 0110 10 01 10 1 10 10 010 10 0101 0 1010101 01011010 0 00 00 1 1 10 100101010101101 10 10 10 101 01101001 Dashboards and API Provide definitions and real time views: Via the RESTful HTTP API, command line tools, or the UI query builder deliver pre-computed results: create table foo (    x long, ■ Roll-ups    y string, ■ Drilldowns    t time(hour, min),    z path(/) ■ Trends ); create view select sum(x) from foo where y group by z; create view select count from foo where x, t group by t; 15
  • 26. 16 Analytics
  • 27. count grouped by ... day16 Analytics
  • 28. count grouped by ... day count distinct(session)16 Analytics
  • 29. count grouped by ... day count distinct(session) count16 Analytics
  • 30. count grouped by ... day count distinct(session) countavg(duration)16 Analytics
  • 31. count grouped by ... day count distinct(session) count ... geographyavg(duration)16 Analytics
  • 32. count grouped by ... day count distinct(session) count ... geographyavg(duration) ... browser16 Analytics
  • 33. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ...{ cust_id: user01, ... ... session_id: 102, UK all→228 user01→1 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ...} UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ...17 Analytics
  • 34. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :00→19 :02→105 ...{ cust_id: user01, ... ... session_id: 102, UK all→229 user01→2 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ...} UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...18 Analytics
  • 35. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ... ... ... UK all→228 user01→1 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ...19 Analytics
  • 36. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :01→19 :02→105 ... ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...20 Analytics
  • 37. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :01→19 :02→105 ... ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...20 Analytics
  • 38. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...21 Analytics
  • 39. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...21 Analytics
  • 40. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...22 Analytics
  • 41. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...22 Analytics
  • 42. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ...count all ∅ all→87315 UK→239 US→354 ...23 Analytics
  • 43. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ...count all ∅ all→87315 UK→239 US→354 ...23 Analytics
  • 44. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ...count all ∅ all→87315 UK→239 US→354 ...group all by geo24 Analytics
  • 45. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ...count all ∅ all→87315 UK→239 US→354 ...group all by geo24 Analytics
  • 46. DRILLDOWN TOAPPROXIMATE AGGREGATES ORIGINAL EVENTS Identify the root causes ofFast probabilistic data structures for aggregate resultsCOUNT UNIQUE, TOP n to tradeaccuracy for performance - predictably TRENDING AND CORRELATION Proactively identify k deviation from baseline and breaks from trends Accuracy Performance HIERARCHICAL AGGREGATES Automatic handling of paths, timestamps and geospatial queries 25
  • 47. 26
  • 48. Shameless plug REAL-TIME BIG DATA ANALYTICS, POWERED BY NOSQL ■ Roll-up and transform cubes in real time ■ Leverage NoSQL for write-optimization,DASHBOARDS UI, JSON APIs schema freedom, and horizontal scalabilityACUNU ANALYTICS CASSANDRA ENHANCED FOR OPS HIGHER DENSITY, LOWER TCO OPSENHANCED CASSANDRA UI ■ Enhanced Cassandra for higher density, UI better scalability, simpler managementCASTLE: STORAGE ENGINE ■ ‘Single pane of glass‘ management UICOMMODITY HW OR CLOUD STORAGE CRAFTED FOR BIG DATA ■ In-kernel storage engine designed and optimised for NoSQL databases 27
  • 49. http://bit.ly/UBsdej Analytics
  • 50. THANK YOU @acunu @timmoretonApache, Apache Cassandra, Cassandra, Hadoop, and the eye and elephant logosare trademarks of the Apache Software Foundation. 29