Real-time Cassandra


Published on

Talk given at Denormalised London, 2012-09-20. Discussion of what a real-time system needs to do and why Cassandra is a good fit.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Real-time Cassandra

  1. 1. Real-timeCassandra Richard @richardalow
  2. 2. Outline• What is real-time?• How do databases implement real-time queries?• Why is Cassandra ideal for real-time applications?• Writing real-time applications with Cassandra
  3. 3. What is real-time?
  4. 4. “Of or relating to a system in which input data is processed withinmilliseconds” “Occurring immediately” webopedia“...the most important requirement of a real-time system ispredictability and not performance” wikipedia “...a time frame that is very brief, appearing to be immediate.”“Often real-time response times are understood to be in the order ofmilliseconds and sometimes microseconds” wikipedia
  5. 5. Real-time queries• ‘Give me X’• ‘How many Y?’• ‘What is the top K?’• ‘How many distinct Z from P?’
  6. 6. Real-time definition• Definition a query is processed in real- time if the time to get the answer is at most a constant times the transfer time plus the round-trip timetresponse ≤ C(ttransfer + tping )
  7. 7. Real-time definition• The more you ask for, the longer it takes• For small queries, request dominated by round trip time• No query can take less time than the time to receive it
  8. 8. Real-time definition• Users on faster networks expect a faster response• What we mean by real-time is getting faster
  9. 9. Implications• What does this mean for the database?• Use Google Analytics example• Simple query: ‘How many page views have there been from France in the last 24 hours?’
  10. 10. Requirement• Response is one number• With overhead, say ~1KB• Ping time 1ms• 10Mbit connection => 1KB in ~1ms• 2ms total
  11. 11. Solution 1• grep *.fr /var/log/apache2/*.log• Suppose have 1M hits an hour => 7GB of logs a day• Single disk would take 70s• Need a beefy server to do this• Needs to grow as your audience grows
  12. 12. Solution 2• Maintain a counter for each country• Increment the counter on each hit• On query just read the counter• Maybe it is on disk - 5ms seek• No need to scale speed with traffic
  13. 13. Implications• Real-time queries can only read about as much data as they send to the requester• Need to precompute answers• Store data in a query-centric rather than data-centric view
  14. 14. Age of data• A real-time query will often need to query new data• But not necessarily • Could run batch process pre-compute answers
  15. 15. Solutions
  16. 16. Solutions• How make sure don’t read any more than you have to? • Denormalisation • Organisation of data • Counters
  17. 17. Denormalisation• Hard drive performance constraints: • Sequential IO at 100s MB/s • Seek at 100 IO/s• Avoid random IO• Effective block size 1MB
  18. 18. Denormalisation• Store items accessed at similar times near to each other• Involves copying• Copying isn’t bad • Storage costs <$100 per TB
  19. 19. Organisation of data• If read 100 items off disk, ensure they are next to each other• Saves reading extra data around them and index lookups
  20. 20. Fast range queries• Get me all keys in the range E to I A F H [E, I] I M X
  21. 21. Fast range queries• What happens when you insert? G A Q F [E, I] A G F H vs H I I M M Q X X
  22. 22. Counters• For queries that simply count, increment the counter• Implement inc, dec, get• Store multiple counts e.g. week, day, hour
  23. 23. Cassandra and real-time• Write optimised• Fast merging• Distributed counters
  24. 24. Write optimised• All writes are sequential on disk• Each write is written multiple times during compactions
  25. 25. Fast mergingHow get from this: to this? G A Q F + G A H F I H M I Q M X X
  26. 26. Fast merging• Write out new ordered SSTable• When big enough, merge with existing
  27. 27. G BQ F G A K B Q F Z G HA A B A IF F F F KH H G H MI I K I QM M Q M XX X Z X Z
  28. 28. How fast?
  29. 29. Distributed counters• Distributed, fault tolerant replicated counters• No need for distributed locks• Super fast
  30. 30. Other requirements
  31. 31. What else do we need? Real-time analytics High value getting quick response High cost if service is down Need high availability
  32. 32. What else do we need? Real-time analytics High value getting quick response Need low latency Need data geographically close
  33. 33. Cassandra and HA• No SPOF• Choose point on consistency and availability curve • Tuneable consistency • Replication• Multi data-centre support
  34. 34. Cassandra and low latency• Can configure caches• Can parallelise reads• Multi-DC support enables world-wide replication• Can choose lower consistency to avoid round-trips to other DCs
  35. 35. Writing real-time apps with Cassandra
  36. 36. Real-time apps• Need to write code using a client library• Design data-model• If queries change, code changes
  37. 37. Acunu Analytics• Provides simple RESTful interface to Cassandra counters• Push processing into ingest phase AA event Cassandra counter updates
  38. 38. Acunu Analytics• Event template, e.g., select : ["COUNT", "AVG(loadTime)"], type : { time : [TIME(HOUR; MIN; SEC), ?, 0], page : PATH(/), loadTime : [LONG, 0, 0] }• Specifies “blow-up” strategy according to supported queries• Need to know basics of query in advance, but not whole thing
  39. 39. Features• Simple, real-time, incremental analytics• work done on ingest• sum, count, distinct, avg, stddev, min-max etc• time + hierarchy bucketing• efficient ‘group’ semantics• works with Apache Cassandra
  40. 40. Summary• Formalise what real-time means• Deduced how data must be stored• Explored how Cassandra has these properties• Discussed how Acunu Analytics helps when writing real-time apps