Cassandra on Castle

2,833 views
2,706 views

Published on

An intro to Acunu's Castle project and what it can do for Cassandra. Talk given at Cassandra NYC meetup, Sep 22 2011.

Published in: Technology
2 Comments
2 Likes
Statistics
Notes
  • We have a backup tool available to customers that we'll be including in in the next point release very shortly. (Cassandra tolerates node+network failure well, but I agree that's not the same as disaster recovery)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • from release notes:
    [921] No backup support

    There is no systematic way to backup or restore data stored in the Acunu Storage Core.

    Refreshingly honest, but how would someone using an acunu based system work around this?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
2,833
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
2
Likes
2
Embeds 0
No embeds

No notes for slide

Cassandra on Castle

  1. 1. Cassandra on Castle Tim Moreton @timmoretonSaturday, 24 September 2011
  2. 2. Outline • Why Castle? • A [quick] tour of Castle • Cassandra on Castle • An aside into Memcache • Cross-cluster snapshots and clonesSaturday, 24 September 2011
  3. 3. Before the Flood 1990 Small databases BTree indexes BTree File systems RAID Old hardwareSaturday, 24 September 2011
  4. 4. Two Revolutions 2010 Distributed, shared-nothing databases Write-optimised indexes Write-optimised indexes BTree file systems BTree file systems RAID ... RAID New hardware New hardwareSaturday, 24 September 2011
  5. 5. Bridging the Gap 2011 Distributed, shared-nothing databases Castle Castle ... New hardware New hardwareSaturday, 24 September 2011
  6. 6. Saturday, 24 September 2011 Shared memory interface keys Userspace Acunu Kernel values In-kernel async, shared memory ring workloads interface shared buffers userspace Streaming interface range key buffered key buffered queries insert value insert get value get interface kernelspace Doubling Arrays insert Bloom filters queues key get x arrays range arrays queries management mapping layer key doubling array insert merges Arrays key Version tree insert btree key get btree range modlist btree mapping layer queries value arrays Cache "Extent" layer extent block extent cache freespace allocator prefetcher manager & mapper cacheing layer flusher block mapping & page cache Linux Kernel Block layer Memory manager MM layers linuxs block &
  7. 7. Shared memory interface Castle keys Userspace Acunu Kernel userspace interface values In-kernel async, shared memory ring workloads shared buffers kernelspace • Like ZFS+BDB for Big Data Streaming interface interface range key buffered key buffered queries insert value insert get value get • Opensource (GPLv2, MIT Doubling Arrays doubling array mapping layer insert Bloom filters queues key get arrays x for user libraries) range arrays queries management key insert merges Arrays • http://bitbucket.org/acunu mapping layer modlist btree key Version tree insert btree • Loadable Kernel Module, key get btree range queries value arrays Cache targeting CentOS’s 2.6.18 block mapping & • http://www.acunu.com/ cacheing layer "Extent" layer prefetcher extent block extent cache freespace allocator manager flusher & mapper page cache blogs/andy-twigg/why- acunu-kernel/ linuxs block & Linux Kernel MM layers Block layer Memory managerSaturday, 24 September 2011
  8. 8. The Interface Shared memory interface keys Userspace Acunu Kernel userspace interface values In-kernel async, shared memory ring workloads shared buffers kernelspace Streaming interface interface range key buffered key buffered queries insert value insert get value get Doubling Arrays doubling array mapping layer insert Bloom filters queues key get arrays x range queries castle_{back,objects}.c arrays managementSaturday, 24 September 2011 key
  9. 9. The Interface Tree of versions Attachment • Create, snapshot, clone • Attach/detach • Keys: any dimensional • Values: any size v0 • Simple get, put, delete v1 v3 • Iterator, slice interfaces v12 v13 v15 • Streaming interface v16 v24Saturday, 24 September 2011
  10. 10. The Interface Shared memory interface keys Userspace Acunu Kernel userspace interface values In-kernel async, shared memory ring workloads shared buffers kernelspace Streaming interface interface range key buffered key buffered queries insert value insert get value get Doubling Arrays doubling array mapping layer insert Bloom filters queues key get arrays x range queries castle_{back,objects}.c arrays managementSaturday, 24 September 2011 key
  11. 11. interface userspac values In-kernel async, shared memory ring workloads shared buffers kernelspace interface Doubling Array Streaming interface range key buffered key buffered queries insert value insert get value get Doubling Arrays doubling array mapping layer insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree key Version tree insert btree key get btree range queries value arrays castle_{da,bloom}.cSaturday, 24 September 2011
  12. 12. Doubling Array Inserts 2 2 9 9 Buffer arrays in memory until we have > B of themSaturday, 24 September 2011
  13. 13. Doubling Array Inserts 11 2 9 2 8 9 11 8 8 11 etc...Saturday, 24 September 2011
  14. 14. 8KB @ 100MB/s, w/ 8ms seek 100 / 5 = 100 IOs/s = 20 updates/s ~ log (2^30)/log 100 = 5 IOs/update Range Query Update (Size Z) O(logB N) O(Z/B) B-Tree random IOs random IOs O((log N)/B) O(Z/B) Doubling Array sequential IOs sequential IOs ~ log (2^30)/100 8KB @ 100MB/s 13k / 0.2 = 0.2 IOs/update = 13k IOs/s = 65k updates/s B = “block size”, say 8KB at 100 bytes/entry ~= 100 entriesSaturday, 24 September 2011
  15. 15. interface userspac values In-kernel async, shared memory ring workloads shared buffers kernelspace interface Doubling Array Streaming interface range key buffered key buffered queries insert value insert get value get Doubling Arrays doubling array mapping layer insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree key Version tree insert btree key get btree range queries value arrays castle_{da,bloom}.cSaturday, 24 September 2011
  16. 16. Doubling Arrays doubling array mapping layer “Mod-list” B-Tree insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree key Version tree insert btree key get btree range queries value arrays Cache block mapping & cacheing layer "Extent" layer prefetcher extent block extent cache So how to do snapshots and clones? freespace manager allocator flusher & mapper page cache castle_{btree,versions}.c k& Linux Kernel sSaturday, 24 September 2011
  17. 17. Copy-on-Write BTree Idea: • Apply path-copying [DSST] to the B-tree Problems: • Space blowup: Each update may rewrite an entire path • Slow updates: as above A log file system makes updates sequential, but relies on random access and garbage collection (achilles heel!)Saturday, 24 September 2011
  18. 18. Range Update Space Query CoW B- O(logB Nv) O(Z/B) O(N B logB Nv) Tree random IOs random IOs “BigTable” O((log N)/B) O(Z/B) O(VN) LevelDB style DA sequential IOs sequential IOs “Mod-list” O((log N)/B) O(Z/B) Castle in a DA sequential IOs sequential IOs O(N) Nv = #keys live (accessible) at version vSaturday, 24 September 2011
  19. 19. Stratified B-Trees • Retires Copy-On-Write B-Trees, the bedrock of modern storage (Sun ZFS, NetApp WAFL, ...) • Patent-pending, next-generation data structure • Theoretically optimal, yet highly practical Copy-on-write B-tree finally beaten. Andy Twigg∗ , Andrew Byde∗ , Grzegorz Miło´∗ , Tim Moreton∗ , John Wilkes†∗ and Tom Wilkie∗ ∗ s Acunu, † Google http://goo.gl/INTb1 firstname@acunu.com Abstract This paper presents some recent results on new con- structions for B-trees that go beyond copy-on-write, that A classic versioned data structure in storage and com- we call ‘stratified B-trees’. They solve two open prob- puter science is the copy-on-write (CoW) B-tree – it un- lems: Firstly. they offer a fully-versioned B-tree with derlies many of today’s file systems and databases, in- optimal space and the same lookup time as the CoW B- cluding WAFL, ZFS, Btrfs and more. Unfortunately, it tree. Secondly, they are the first to offer other points on doesn’t inherit the B-tree’s optimality properties; it has the Pareto optimal query/update tradeoff curve, and in poor space utilization, cannot offer fast updates, and re- particular, our structures offer fully-versioned updates in http://goo.gl/gzihe lies on random IO to scale. Yet, nothing better has o(1) IOs, while using linear space. Experimental results been developed since. We describe the ‘stratified B-tree’, indicate 100,000s updates/s on a large SATA disk, two which beats the CoW B-tree in every way. In particu- orders of magnitude faster than a CoW B-tree. lar, it is the first versioned dictionary to achieve optimal Since stratified B-trees subsume CoW B-trees (and in- tradeoffs between space, query and update performance. deed all other known versioned external-memory dictio- Therefore, we believe there is no longer a good reason to naries), we believe there is no longer a good reason to use CoW B-trees for versioned data stores. use them for versioned data stores. Acunu is develop- ing a commercial in-kernel implementation of stratified B-tress, which we hope to release soon. 1 IntroductionSaturday, 24 September 2011 The B-tree was presented in 1972 [1], and it survives
  20. 20. Doubling Arrays doubling array mapping layer “Mod-list” B-Tree insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree key Version tree insert btree key get btree range queries value arrays Cache block mapping & cacheing layer "Extent" layer prefetcher extent block extent cache freespace allocator manager flusher & mapper page cache castle_{btree,versions}.c k& Linux Kernel sSaturday, 24 September 2011
  21. 21. Arrays mapping layer modlist btree key Version tree insert btree Disk Layout: RDA key get btree range queries value arrays Cache block mapping & cacheing layer "Extent" layer prefetcher extent block extent cache freespace allocator manager flusher & mapper page cache linuxs block & Linux Kernel MM layers Block layer Memory manager castle_{cache,extent,freespace,rebuild}.cSaturday, 24 September 2011
  22. 22. Disk Layout: RDA random duplicate allocation 4 2 1 4 5 2 5 3 1 3 7 10 7 6 8 9 9 10 6 8 15 12 14 11 13 14 11 12 13 15 16 16Saturday, 24 September 2011
  23. 23. SSD tiering [taster] • Why? Key to >cache random reads • v1: SSD for metadata structures • Redundancy provided by disk • SSD for selected collection data (CFs) • 10x write rate on SSDs than regular FSsSaturday, 24 September 2011
  24. 24. Saturday, 24 September 2011 Shared memory interface keys Userspace Acunu Kernel values In-kernel async, shared memory ring workloads interface shared buffers userspace Streaming interface range key buffered key buffered queries insert value insert get value get interface kernelspace Doubling Arrays insert Bloom filters queues key get x arrays range arrays queries management mapping layer key doubling array insert merges Arrays key Version tree insert btree key get btree range modlist btree mapping layer queries value arrays Cache "Extent" layer extent block extent cache freespace allocator prefetcher manager & mapper cacheing layer flusher block mapping & page cache Linux Kernel Block layer Memory manager MM layers linuxs block &
  25. 25. Cassandra on Castle • Eliminate all ‘storage heavy lifting’ • Extend ColumnFamilyStore • Efficient JNI bindings to libcastle C library • row, col, value, t: (row, col) -> (t,value) • row, a|b|c|d, value, t: (row, a, b, c, d, col) -> (t,value)Saturday, 24 September 2011
  26. 26. Small random inserts Inserting 3 billion rows Acunu powered Cassandra - ‘standard’ Cassandra -Saturday, 24 September 2011
  27. 27. Insert latency While inserting 3 billion rows Acunu powered Cassandra x ‘standard’ Cassandra +Saturday, 24 September 2011
  28. 28. Small random range queries Performed immediately after inserts Acunu powered Cassandra - ‘standard’ Cassandra -Saturday, 24 September 2011
  29. 29. Memcache + Cassandra get/insert Cass client get/put memcached Same data! 100k random Replication logic inserts/sec! Replication logic Text Cassandra memcache Cassandra memcache Castle Castle ... H/W H/WSaturday, 24 September 2011
  30. 30. v2: Cross-cluster versions • Eventually consistent • Spans data centers • Tolerates node failure, network partition • High performance, no space overhead • Dev/Test/Staging on Prod clustersSaturday, 24 September 2011
  31. 31. So... • Castle = ZFS + BDB for Big Data • Cassandra on Castle runs apps unmodified • Up to 100x throughput under load • No GC pauses: very predictable latencies • v2: Cross-cluster snapshot and clone • SSD optimisationSaturday, 24 September 2011
  32. 32. Saturday, 24 September 2011
  33. 33. Questions? Tim Moreton // @timmoreton http://goo.gl/INTb1 http://goo.gl/gzihe Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and elephant logos are trademarks of the Apache Software Foundation.Saturday, 24 September 2011

×