Your SlideShare is downloading. ×
  • Like
All Your Base
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

All Your Base

  • 426 views
Published

Slides from Tim Moreton's talk on "Apache Cassandra and Why BASE is great for real-time analytics" from All Your Base. Nov 23, 2012.

Slides from Tim Moreton's talk on "Apache Cassandra and Why BASE is great for real-time analytics" from All Your Base. Nov 23, 2012.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
426
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
20
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Apache Cassandraand why BASE is great forreal-time analyticsTim Moreton
  • 2. • Cassandra -- What makes it different?• Who’s using it, and for what?• DIY Real Time Analytics on Cassandra• The Easy Option -- Acunu Analytics 2
  • 3. BigTable Data model Dynamo distribution 3
  • 4. BigTable Data model Dynamo distribution Incubator, 2009 Top-Level, 2010 Open sourced, 2008 3
  • 5. BigTable Data model Dynamo distribution Incubator, 2009 Top-Level, 2010 Open sourced, 2008 3
  • 6. • Multi-master architecture: no SPOF• Tunable consistency, multi-DC aware• High performance, optimised for writes• Atomic counters 4
  • 7. Data modeluser345: { chess: { lives: 2, score: 33 ... } ...} 5
  • 8. Data modeluser345: { chess: { Row key lives: 2, Rows arranged randomly around cluster. score: 33 Load balanced, but no ordering. ... Put stuff to access sequentially within a row. } ...} [chess, lives]: [chess, score]: user345 2 44 [go, lives]: [monop, avatar]: [monop, score]: user292 4 top_hat 33 [monop, score]: user188 13 5
  • 9. Data modeluser345: { chess: { Column key lives: 2, Compound columns allow you to create score: 33 multiple ordered ‘dictionaries’ in a row. ... } ...} [chess, lives]: [chess, score]: user345 2 44 [go, lives]: [monop, avatar]: [monop, score]: user292 4 top_hat 33 [monop, score]: user188 13 6
  • 10. Data modeluser345: { chess: { Flexible schemas lives: 2, “Columns” are really just cell identifiers. score: 33 ... Rows can be VERY wide. } ...} [chess, lives]: [chess, score]: user345 2 44 [go, lives]: [monop, avatar]: [monop, score]: user292 4 top_hat 33 [monop, score]: user188 13 7
  • 11. Tunable consistency — per operationWrite:#Replicas ONE QUORUM ALLRead:#Replicas ONE QUORUM ALL 8
  • 12. Tunable consistency — per operationWrite:#Replicas ONE QUORUM ALL Risk of replica failing, Multiple valuesRead:#Replicas ONE QUORUM ALL 8
  • 13. Tunable consistency — per operationWrite:#Replicas ONE QUORUM ALL More likely to return out-of-date dataRead:#Replicas ONE QUORUM ALL 8
  • 14. Tunable consistency — per operationWrite:#Replicas ONE QUORUM ALL Never going to say “ok” if a replica is downRead:#Replicas ONE QUORUM ALL 8
  • 15. Tunable consistency — per operationWrite:#Replicas ONE QUORUM ALLRead:#Replicas ONE QUORUM ALL 8
  • 16. Multi data center aware DC 1 DC 2 r1 r2 r1 r2 9
  • 17. Multi data center aware DC 1 DC 2user345 r1 r2 r1 r2 9
  • 18. Session Real Time Stores Analytics• Read dominated • Write dominated• Updates to existing items • Updates very rare• Probably fits in RAM • Read “results” mostly• Distribute for availability • Distribute for availability, performance, capacity• Want: Atomicity • Want: Rich queries 10
  • 19. An analytics app on CassandraSource: Twitter 11
  • 20. eg: “show me the number of mentionsof ‘Acunu’ per day, between May andNovember 2011, on Twitter”Batch (Hadoop) approach wouldrequire processing ~30 billiontweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.html 12
  • 21. eg: “show me the number of mentionsof ‘Acunu’ per day, between May andNovember 2011, on Twitter”Batch (Hadoop) approach wouldrequire processing ~30 billiontweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.htmlCassandra approach:For each tweet,increment a bunch of counters,such that answering a queryis as easy as reading some counters 12
  • 22. 12:32:15 I like #trafficlights 12:33:43 Nobody expects... 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks! [1234, man] +1 [1234, acunu] +1 [1234, rock] +113 Analytics
  • 23. 12:32:15 I like #trafficlights 12:33:43 Nobody expects... 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks! [1234, man] +1 [1234, acunu] +1 [1234, rock] +1 Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ... Row key is ‘big’ time Column key is ‘small’ bucket time bucket13 Analytics
  • 24. Solution Con Scalability $$$ Not real time Spartan query semantics: complex, DIY solutions 14
  • 25. Acunu Analytics High Velocity As events are ingested: ■ Update real time views Event Streams HTTP JSON, MQ, flume ■ Refresh dashboards ■ Preserve original event data 0101 01 0 1 000 10101101 0001110011010 1 10 0 1 1 01 011011 0 01 01 01 010 1 10 1 10 01 010 10 110 10 010 101 00 0 010 01011 0 01 10 0 10 0 11 1110 10 10 10 01 1 10 0 10 01 0110 10 01 10 1 10 10 010 10 0101 0 1010101 01011010 0 00 00 1 1 10 100101010101101 10 10 10 101 01101001 Dashboards and API Provide definitions and real time views: Via the RESTful HTTP API, command line tools, or the UI query builder deliver pre-computed results: create table foo (    x long, ■ Roll-ups    y string, ■ Drilldowns    t time(hour, min),    z path(/) ■ Trends ); create view select sum(x) from foo where y group by z; create view select count from foo where x, t group by t; 15
  • 26. 16 Analytics
  • 27. count grouped by ... day16 Analytics
  • 28. count grouped by ... day count distinct(session)16 Analytics
  • 29. count grouped by ... day count distinct(session) count16 Analytics
  • 30. count grouped by ... day count distinct(session) countavg(duration)16 Analytics
  • 31. count grouped by ... day count distinct(session) count ... geographyavg(duration)16 Analytics
  • 32. count grouped by ... day count distinct(session) count ... geographyavg(duration) ... browser16 Analytics
  • 33. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ...{ cust_id: user01, ... ... session_id: 102, UK all→228 user01→1 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ...} UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ...17 Analytics
  • 34. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :00→19 :02→105 ...{ cust_id: user01, ... ... session_id: 102, UK all→229 user01→2 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ...} UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...18 Analytics
  • 35. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ... ... ... UK all→228 user01→1 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ...19 Analytics
  • 36. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :01→19 :02→105 ... ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...20 Analytics
  • 37. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :01→19 :02→105 ... ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...20 Analytics
  • 38. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...21 Analytics
  • 39. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...21 Analytics
  • 40. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...22 Analytics
  • 41. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ...22 Analytics
  • 42. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ...count all ∅ all→87315 UK→239 US→354 ...23 Analytics
  • 43. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ...count all ∅ all→87315 UK→239 US→354 ...23 Analytics
  • 44. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ...count all ∅ all→87315 UK→239 US→354 ...group all by geo24 Analytics
  • 45. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ...where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ...where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ...count all ∅ all→87315 UK→239 US→354 ...group all by geo24 Analytics
  • 46. DRILLDOWN TOAPPROXIMATE AGGREGATES ORIGINAL EVENTS Identify the root causes ofFast probabilistic data structures for aggregate resultsCOUNT UNIQUE, TOP n to tradeaccuracy for performance - predictably TRENDING AND CORRELATION Proactively identify k deviation from baseline and breaks from trends Accuracy Performance HIERARCHICAL AGGREGATES Automatic handling of paths, timestamps and geospatial queries 25
  • 47. 26
  • 48. Shameless plug REAL-TIME BIG DATA ANALYTICS, POWERED BY NOSQL ■ Roll-up and transform cubes in real time ■ Leverage NoSQL for write-optimization,DASHBOARDS UI, JSON APIs schema freedom, and horizontal scalabilityACUNU ANALYTICS CASSANDRA ENHANCED FOR OPS HIGHER DENSITY, LOWER TCO OPSENHANCED CASSANDRA UI ■ Enhanced Cassandra for higher density, UI better scalability, simpler managementCASTLE: STORAGE ENGINE ■ ‘Single pane of glass‘ management UICOMMODITY HW OR CLOUD STORAGE CRAFTED FOR BIG DATA ■ In-kernel storage engine designed and optimised for NoSQL databases 27
  • 49. http://bit.ly/UBsdej Analytics
  • 50. THANK YOU @acunu @timmoretonApache, Apache Cassandra, Cassandra, Hadoop, and the eye and elephant logosare trademarks of the Apache Software Foundation. 29