Cassandra & the Acunu Data Platform

3,279 views

Published on

Published in: Technology
  • Be the first to comment

Cassandra & the Acunu Data Platform

  1. 1. Cassandra & the Acunu Data Platform Tom WilkieFounder & VP Engineering @tom_wilkie
  2. 2. Before the Flood 1990 Small databases BTree indexes BTree File systems RAID Old hardware
  3. 3. Two Revolutions 2010 Distributed, shared-nothing databasesWrite-optimised indexes Write-optimised indexesBTree file systems BTree file systems RAID ... RAID New hardware New hardware
  4. 4. Bridging the Gap 2011 Distributed, shared-nothing databases Castle Castle ...New hardware New hardware
  5. 5. Why?
  6. 6. SH OT S*SN AP * And clones!
  7. 7. Small random inserts Inserting 3 billion rows Acunu powered Cassandra - ‘standard’ Cassandra -
  8. 8. Insert latencyWhile inserting 3 billion rows Acunu powered Cassandra x ‘standard’ Cassandra +
  9. 9. Small random range queriesPerformed immediately after inserts Acunu powered Cassandra - ‘standard’ Cassandra -
  10. 10. Performance summary Standard Acunu Benefitsinserts rate ~32k/s ~45k/s >1.4x95% latency ~32s ~0.3s >100x gets rate ~100/s ~350/s >3.5x95% latency ~2s ~0.5s >4xrange queries ~0.4/s ~40/s >100x 95% latency ~15s ~2s >7.5x
  11. 11. How?
  12. 12. Shared memory interface keys Userspace Acunu Kernel values In-kernel async, shared memory ring workloads interface shared buffersuserspace Streaming interface range key buffered key buffered queries insert value insert get value get interfacekernelspace Doubling Arrays insert Bloom filters queues key get x arrays range arrays queries managementmapping layer keydoubling array insert merges Arrays key Version tree insert btree key get btree rangemodlist btreemapping layer queries value arrays Cache "Extent" layer extent block extent cache freespace allocator prefetcher manager & mapper cacheing layer flusherblock mapping & page cache Linux Kernel Block layer Memory manager MM layerslinuxs block &
  13. 13. Shared memory interface Castle keys Userspace Acunu Kerneluserspace interface values In-kernel async, shared memory ring workloads shared bufferskernelspace Streaming interface interface range key buffered key buffered queries insert value insert get value get Doubling Arrays • Opensource (GPLv2, MITdoubling arraymapping layer for user libraries) insert Bloom filters queues key get arrays x range arrays queries management key • insert merges http://bitbucket.org/acunu Arraysmapping layer •modlist btree key Version tree Loadable Kernel Module, insert btree key get btree targeting CentOS’s 2.6.18 range queries value arrays • Cacheblock mapping & http://www.acunu.com/ cacheing layer "Extent" layer prefetcher extent block extent cache blogs/andy-twigg/why- freespace allocator manager flusher & mapper page cache acunu-kernel/linuxs block & Linux Kernel MM layers Block layer Memory manager
  14. 14. The Interface Shared memory interface keys Userspace Acunu Kerneluserspace interface values In-kernel async, shared memory ring workloads shared bufferskernelspace Streaming interface interface range key buffered key buffered queries insert value insert get value get Doubling Arraysubling arrayapping layer insert Bloom filters queues key get arrays x range queries castle_{back,objects}.c arrays management
  15. 15. The Interface v0 v1 v2v5 v3 v4 v6
  16. 16. interfaceuserspac values In-kernel async, shared memory ring workloads shared bufferskernelspace interface Doubling Array Streaming interface range key buffered key buffered queries insert value insert get value get Doubling Arraysdoubling arraymapping layer insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arraysmapping layermodlist btree key Version tree insert btree key get btree range queries value arrays castle_{da,bloom}.c
  17. 17. B-TreeB • If node is full, split and insert new node into parent (recurse) logB N • For random inserts, nodes placed randomly on disk
  18. 18. Range Query Update (Size Z) O(logB N) O(Z/B)B-Tree random IOs random IOsB = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
  19. 19. Doubling Array Inserts 2 2 9 9 Buffer arrays in memoryuntil we have > B of them
  20. 20. Doubling Array Inserts11 2 9 2 8 9 11 8 8 11 etc...Similar to log-structured merge trees (LSM), cache-oblivious lookahead array (COLA), ...
  21. 21. Demohttps://acunu-videos.s3.amazonaws.com/dajs.html
  22. 22. Range Query Update (Size Z) O(logB N) O(Z/B) B-Tree random IOs random IOs O((log N)/B)Doubling Array sequential IOs B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
  23. 23. Doubling Array Queries query(k)• Add an index to each array to do lookups• query(k) searches each array independently
  24. 24. Doubling Array Queries query(k)• Bloom Filters can help exclude arrays from search• ... but don’t help with range queries
  25. 25. 8KB @ 100MB/s, w/ 8ms seek 100 / 5 = 100 IOs/s = 20 updates/s~ log (2^30)/log 100= 5 IOs/update Range Query Update (Size Z) O(logB N) O(Z/B) B-Tree random IOs random IOs O((log N)/B) O(Z/B) Doubling Array sequential IOs sequential IOs ~ log (2^30)/100 8KB @ 100MB/s 13k / 0.2= 0.2 IOs/update = 13k IOs/s = 65k updates/s B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
  26. 26. interfaceuserspac values In-kernel async, shared memory ring workloads shared bufferskernelspace interface Doubling Array Streaming interface range key buffered key buffered queries insert value insert get value get Doubling Arraysdoubling arraymapping layer insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arraysmapping layermodlist btree key Version tree insert btree key get btree range queries value arrays castle_{da,bloom}.c
  27. 27. Doubling Arraysdoubling arraymapping layer “Mod-list” B-Tree insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arraysmapping layermodlist btree key Version tree insert btree key get btree range queries value arrays Cacheblock mapping & cacheing layer "Extent" layer prefetcher extent block extent cache freespace allocator manager flusher & mapper page cache castle_{btree,versions}.c&
  28. 28. Copy-on-Write BTree Idea: • Apply path-copying [DSST] to the B-tree Problems: • Space blowup: Each update may rewrite an entire path • Slow updates: as aboveA log file system makes updates sequential, but relies onrandom access and garbage collection (achilles heel!)
  29. 29. Range Update Space QueryCoW B- O(logB Nv) O(Z/B) O(N B logB Nv) Tree random IOs random IOs Nv = #keys live (accessible) at version v
  30. 30. “BigTable” snapshotsv1 • Inserts produce arrays1 a 1 b
  31. 31. “BigTable” snapshotsv1 v2 • Inserts produce arrays21 a 2 1 b 1 c • Snapshots increment ref counts on arrays • Merges product more arrays, decrement ref count on old arrays
  32. 32. “BigTable” snapshotsv1 v2 • Inserts produce arrays1 a 1 b • Snapshots increment ref counts on arrays • Merges product more1 a b c arrays, decrement ref count on old arrays
  33. 33. “BigTable” snapshotsv1 v2 • Inserts produce arrays1 a 1 b • Snapshots increment ref counts on arrays • Merges product more1 a b c arrays, decrement ref count on old arrays • Space blowup
  34. 34. Range Update Space Query CoW B- O(logB Nv) O(Z/B) O(N B logB Nv) Tree random IOs random IOs“BigTable” O((log N)/B) O(Z/B) O(VN) style DA sequential IOs sequential IOs Nv = #keys live (accessible) at version v
  35. 35. “Mod-list” BTree Idea: • Apply fat-nodes [DSST] to the B-tree • ie insert (key, version, value) tuples, with special operations Problems: • Similar performance to a BTreeIf you limit the #versions, can be constructedsequentially, and embedded into a DA
  36. 36. Range Update Space Query CoW B- O(logB Nv) O(Z/B) O(N B logB Nv) Tree random IOs random IOs“BigTable” O((log N)/B) O(Z/B) O(VN) LevelDB style DA sequential IOs sequential IOs“Mod-list” O((log N)/B) O(Z/B)CASTLE in a DA sequential IOs sequential IOs O(N) Nv = #keys live (accessible) at version v
  37. 37. Stratified BTree v1 v2 v2 v1 v2 v1 v0 v1 v0 v1 v0 v1 Problem: newer older Embedded “Mod- merge (duplicates removed) list” #versions limit Solution: k1 k2 k3 k4 k5 v1 v0 v2 v1 v0 v2 v1 v0 v2 v1 Version-split arrays v-split during merges k1 k2 k3 k4 k5 v0 entries here are {v2} v0 v2 v2 v0 v2 v0 duplicates k1 k2 k4 k5 v1 v2{v1,v0} v1 v0 v1 v0 v1 v0 v1
  38. 38. Doubling Arraysdoubling arraymapping layer “Mod-list” B-Tree insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arraysmapping layermodlist btree key Version tree insert btree key get btree range queries value arrays Cacheblock mapping & cacheing layer "Extent" layer prefetcher extent block extent cache freespace allocator manager flusher & mapper page cache castle_{btree,versions}.c&
  39. 39. Arraysmapping layermodlist btree key Version tree insert btree Disk Layout: RDA key get btree range queries value arrays Cacheblock mapping & cacheing layer "Extent" layer prefetcher extent block extent cache freespace allocator manager flusher & mapper page cachelinuxs block & Linux Kernel MM layers Block layer Memory manager castle_{cache,extent,freespace,rebuild}.c
  40. 40. Disk Layout: RDA random duplicate allocation 4 2 1 4 5 2 5 3 1 3 7 10 7 6 8 9 9 10 6 8 15 12 14 11 13 14 11 12 13 15 16 16
  41. 41. Future
  42. 42. Memcache + Cassandraget/insert get/put Cass client memcached 100k random inserts/sec! Cassandra memcache Cassandra memcache Castle Castle ... H/W H/W
  43. 43. v1 v1 v1 v1v12 v13 v15 v12 v13 v15 v12 v13 v15 v12 v13 v15 v16 v24 v16 v24 v16 v24 v16 v24
  44. 44. • Castle: like BDB, but for Big Data• 2 orders of magnitude better performance and predictability• Part of the Acunu Data Platform
  45. 45. Questions? Tom Wilkie @tom_wilkie tom@acunu.com http://bitbucket.org/acunu http://github.com/acunuhttp://www.acunu.com/download http://www.acunu.com/insights
  46. 46. References[LSM] The Log-Structured Merge-Tree (LSM-Tree)Patrick ONeil, Edward Cheng, Dieter Gawlick,Elizabeth ONeil Stratified B-trees and versioned dictionaries, - Andy http://staff.ustc.edu.cn/~jpq/paper/flash/1996-The Twigg, Andrew Byde, Grzegorz Miłoś, Tim Moreton, %20Log-Structured%20Merge-Tree%20%28LSM- John Wilkes, Tom Wilkie, HotStorage’11 Tree%29.pdf http://www.usenix.org/event/hotstorage11/tech/ final_files/Twigg.pdf[COLA] Cache-Oblivious Streaming B-trees,Michael A. Bender et al [RDA] Random duplicate storage strategies for http://www.cs.sunysb.edu/~bender/newpub/ load balancing in multimedia servers, 2000, Joep BenderFaFi07.pdf Aerts and Jan Korst and Sebastian Egner http://www.win.tue.nl/~joep/IPL.ps[DSST] Making Data Structures Persistent - J. R.Driscoll, N. Sarnak, D. D. Sleator, R. E. Tarjan, Making Apache, Apache Cassandra, Cassandra, Hadoop, andData Structures Persistent, Journal of Computer the eye and elephant logos are trademarks of theand System Sciences,Vol. 38, No. 1, 1989 Apache Software Foundation. http://www.cs.cmu.edu/~sleator/papers/making- data-structures-persistent.pdf

×