Castle: ReinventingStorage for Big Data      Tom WilkieFounder & VP Engineering     @tom_wilkie
Before the Flood         1990     Small databases      BTree indexes    BTree File systems          RAID      Old hardware
Two Revolutions                          2010     Distributed, shared-nothing databasesWrite-optimised indexes          Wr...
Bridging the Gap                 2011  Distributed, shared-nothing databases   Castle                      Castle         ...
SH OT S*SN AP         * And clones!
What’s in the Castle?
Shared memory interface                                                            keys                                   ...
Shared memory interface                                                                                                   ...
The Interface               Shared memory interface                                                   keys                ...
The Interface              v0      v1           v3v12   v13    v15      v16   v24
interfaceuserspac                                             values                                                      ...
B-TreeB                •   If node is full, split and                    insert new node into                    parent (r...
Range Query                      Update                                          (Size Z)                       O(logB N) ...
Doubling Array                      Inserts   2        2   9   9 Buffer arrays in memoryuntil we have > B of them
Doubling Array                       Inserts11          2   9       2   8   9   11 8          8   11                      ...
Demohttps://acunu-videos.s3.amazonaws.com/dajs.html
Range Query                         Update                                             (Size Z)                          O...
Doubling Array                 Queries              query(k)• Add an index to each array to do lookups• query(k) searches ...
Doubling Array                   Queries                query(k)• Bloom Filters can help exclude arrays from  search• ... ...
8KB @ 100MB/s, w/ 8ms seek      100 / 5                        = 100 IOs/s          = 20 updates/s~ log (2^30)/log 100= 5 ...
interfaceuserspac                                             values                                                      ...
Doubling Arraysdoubling arraymapping layer                  “Mod-list” B-Tree                                insert       ...
Copy-on-Write BTree                             Idea:                          • Apply path-copying [DSST] to             ...
Range             Update                          Space                            QueryCoW B-        O(logB Nv)      O(Z/...
“BigTable” snapshotsv1                   • Inserts produce arrays1    a   1   b
“BigTable” snapshotsv1       v2                          • Inserts produce arrays21    a   2         1    b   1   c   • Sn...
“BigTable” snapshotsv1       v2                         • Inserts produce arrays1    a       1       b   • Snapshots incre...
“BigTable” snapshotsv1       v2                         • Inserts produce arrays1    a       1       b   • Snapshots incre...
Range               Update                             Space                                Query CoW B-         O(logB Nv...
“Mod-list” BTree                                Idea:                             • Apply fat-nodes [DSST] to the         ...
Range               Update                             Space                                Query CoW B-         O(logB Nv...
Stratified BTree        v1   v2        v2    v1        v2    v1                        v0    v1        v0    v1   v0   v1  ...
Doubling Arraysdoubling arraymapping layer                  “Mod-list” B-Tree                                insert       ...
Arraysmapping layermodlist btree                                key                                               Version ...
Disk Layout: RDA          random duplicate allocation 4    2      1    4    5    2    5    3    1    3 7    10     7    6 ...
PerformanceComparison
Small random inserts  Inserting 3 billion rows                 Acunu powered Cassandra -                      ‘standard’ C...
Insert latencyWhile inserting 3 billion rows                     Acunu powered Cassandra x                          ‘stand...
Small random range queriesPerformed immediately after inserts                       Acunu powered Cassandra -             ...
Performance summary                Standard   Acunu    Benefitsinserts rate     ~32k/s    ~45k/s   >1.4x95% latency       ~...
Future
Memcache + Cassandraget/insert                            get/put                  Cass client                     memcach...
v1                 v1                      v1                           v1v12         v13                 v15  v12        ...
• Castle: like BDB, but for Big Data• DA: transforms random IO into  sequential IO• Snapshots & Clones: addressing  real p...
Questions?        Tom Wilkie         @tom_wilkie       tom@acunu.com   http://bitbucket.org/acunu    http://github.com/acu...
References[LSM] The Log-Structured Merge-Tree (LSM-Tree)Patrick ONeil, Edward Cheng, Dieter Gawlick,Elizabeth ONeil       ...
In the brain of Tom Wilkie
Upcoming SlideShare
Loading in …5
×

In the brain of Tom Wilkie

4,730 views

Published on

A repeat of my oscon talk, with some more details. Check out the video at http://skillsmatter.com/podcast/nosql/castle-big-data

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,730
On SlideShare
0
From Embeds
0
Number of Embeds
67
Actions
Shares
0
Downloads
41
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

In the brain of Tom Wilkie

  1. 1. Castle: ReinventingStorage for Big Data Tom WilkieFounder & VP Engineering @tom_wilkie
  2. 2. Before the Flood 1990 Small databases BTree indexes BTree File systems RAID Old hardware
  3. 3. Two Revolutions 2010 Distributed, shared-nothing databasesWrite-optimised indexes Write-optimised indexesBTree file systems BTree file systems RAID ... RAID New hardware New hardware
  4. 4. Bridging the Gap 2011 Distributed, shared-nothing databases Castle Castle ...New hardware New hardware
  5. 5. SH OT S*SN AP * And clones!
  6. 6. What’s in the Castle?
  7. 7. Shared memory interface keys Userspace Acunu Kernel values In-kernel async, shared memory ring workloads interface shared buffersuserspace Streaming interface range key buffered key buffered queries insert value insert get value get interfacekernelspace Doubling Arrays insert Bloom filters queues key get x arrays range arrays queries managementmapping layer keydoubling array insert merges Arrays key Version tree insert btree key get btree rangemodlist btreemapping layer queries value arrays Cache "Extent" layer extent block extent cache freespace allocator prefetcher manager & mapper cacheing layer flusherblock mapping & page cache Linux Kernel Block layer Memory manager MM layerslinuxs block &
  8. 8. Shared memory interface Castle keys Userspace Acunu Kerneluserspace interface values In-kernel async, shared memory ring workloads shared bufferskernelspace Streaming interface interface range key buffered key buffered queries insert value insert get value get Doubling Arrays • Opensource (GPLv2, MITdoubling arraymapping layer for user libraries) insert Bloom filters queues key get arrays x range arrays queries management key • insert merges http://bitbucket.org/acunu Arraysmapping layer •modlist btree key Version tree Loadable Kernel Module, insert btree key get btree targeting CentOS’s 2.6.18 range queries value arrays • Cacheblock mapping & http://www.acunu.com/ cacheing layer "Extent" layer prefetcher extent block extent cache blogs/andy-twigg/why- freespace allocator manager flusher & mapper page cache acunu-kernel/linuxs block & Linux Kernel MM layers Block layer Memory manager
  9. 9. The Interface Shared memory interface keys Userspace Acunu Kerneluserspace interface values In-kernel async, shared memory ring workloads shared bufferskernelspace Streaming interface interface range key buffered key buffered queries insert value insert get value get Doubling Arraysubling arrayapping layer insert Bloom filters queues key get arrays x range queries castle_{back,objects}.c arrays management
  10. 10. The Interface v0 v1 v3v12 v13 v15 v16 v24
  11. 11. interfaceuserspac values In-kernel async, shared memory ring workloads shared bufferskernelspace interface Doubling Array Streaming interface range key buffered key buffered queries insert value insert get value get Doubling Arraysdoubling arraymapping layer insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arraysmapping layermodlist btree key Version tree insert btree key get btree range queries value arrays castle_{da,bloom}.c
  12. 12. B-TreeB • If node is full, split and insert new node into parent (recurse) logB N • For random inserts, nodes placed randomly on disk
  13. 13. Range Query Update (Size Z) O(logB N) O(Z/B)B-Tree random IOs random IOsB = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
  14. 14. Doubling Array Inserts 2 2 9 9 Buffer arrays in memoryuntil we have > B of them
  15. 15. Doubling Array Inserts11 2 9 2 8 9 11 8 8 11 etc...Similar to log-structured merge trees (LSM), cache-oblivious lookahead array (COLA), ...
  16. 16. Demohttps://acunu-videos.s3.amazonaws.com/dajs.html
  17. 17. Range Query Update (Size Z) O(logB N) O(Z/B) B-Tree random IOs random IOs O((log N)/B)Doubling Array sequential IOs B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
  18. 18. Doubling Array Queries query(k)• Add an index to each array to do lookups• query(k) searches each array independently
  19. 19. Doubling Array Queries query(k)• Bloom Filters can help exclude arrays from search• ... but don’t help with range queries
  20. 20. 8KB @ 100MB/s, w/ 8ms seek 100 / 5 = 100 IOs/s = 20 updates/s~ log (2^30)/log 100= 5 IOs/update Range Query Update (Size Z) O(logB N) O(Z/B) B-Tree random IOs random IOs O((log N)/B) O(Z/B) Doubling Array sequential IOs sequential IOs ~ log (2^30)/100 8KB @ 100MB/s 13k / 0.2= 0.2 IOs/update = 13k IOs/s = 65k updates/s B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
  21. 21. interfaceuserspac values In-kernel async, shared memory ring workloads shared bufferskernelspace interface Doubling Array Streaming interface range key buffered key buffered queries insert value insert get value get Doubling Arraysdoubling arraymapping layer insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arraysmapping layermodlist btree key Version tree insert btree key get btree range queries value arrays castle_{da,bloom}.c
  22. 22. Doubling Arraysdoubling arraymapping layer “Mod-list” B-Tree insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arraysmapping layermodlist btree key Version tree insert btree key get btree range queries value arrays Cacheblock mapping & cacheing layer "Extent" layer prefetcher extent block extent cache freespace allocator manager flusher & mapper page cache castle_{btree,versions}.c&
  23. 23. Copy-on-Write BTree Idea: • Apply path-copying [DSST] to the B-tree Problems: • Space blowup: Each update may rewrite an entire path • Slow updates: as aboveA log file system makes updates sequential, but relies onrandom access and garbage collection (achilles heel!)
  24. 24. Range Update Space QueryCoW B- O(logB Nv) O(Z/B) O(N B logB Nv) Tree random IOs random IOs Nv = #keys live (accessible) at version v
  25. 25. “BigTable” snapshotsv1 • Inserts produce arrays1 a 1 b
  26. 26. “BigTable” snapshotsv1 v2 • Inserts produce arrays21 a 2 1 b 1 c • Snapshots increment ref counts on arrays • Merges product more arrays, decrement ref count on old arrays
  27. 27. “BigTable” snapshotsv1 v2 • Inserts produce arrays1 a 1 b • Snapshots increment ref counts on arrays • Merges product more1 a b c arrays, decrement ref count on old arrays
  28. 28. “BigTable” snapshotsv1 v2 • Inserts produce arrays1 a 1 b • Snapshots increment ref counts on arrays • Merges product more1 a b c arrays, decrement ref count on old arrays • Space blowup
  29. 29. Range Update Space Query CoW B- O(logB Nv) O(Z/B) O(N B logB Nv) Tree random IOs random IOs“BigTable” O((log N)/B) O(Z/B) O(VN) style DA sequential IOs sequential IOs Nv = #keys live (accessible) at version v
  30. 30. “Mod-list” BTree Idea: • Apply fat-nodes [DSST] to the B-tree • ie insert (key, version, value) tuples, with special operations Problems: • Similar performance to a BTreeIf you limit the #versions, can be constructedsequentially, and embedded into a DA
  31. 31. Range Update Space Query CoW B- O(logB Nv) O(Z/B) O(N B logB Nv) Tree random IOs random IOs“BigTable” O((log N)/B) O(Z/B) O(VN) LevelDB style DA sequential IOs sequential IOs“Mod-list” O((log N)/B) O(Z/B)CASTLE in a DA sequential IOs sequential IOs O(N) Nv = #keys live (accessible) at version v
  32. 32. Stratified BTree v1 v2 v2 v1 v2 v1 v0 v1 v0 v1 v0 v1 Problem: newer older Embedded “Mod- merge (duplicates removed) list” #versions limit Solution: k1 k2 k3 k4 k5 v1 v0 v2 v1 v0 v2 v1 v0 v2 v1 Version-split arrays v-split during merges k1 k2 k3 k4 k5 v0 entries here are {v2} v0 v2 v2 v0 v2 v0 duplicates k1 k2 k4 k5 v1 v2{v1,v0} v1 v0 v1 v0 v1 v0 v1
  33. 33. Doubling Arraysdoubling arraymapping layer “Mod-list” B-Tree insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arraysmapping layermodlist btree key Version tree insert btree key get btree range queries value arrays Cacheblock mapping & cacheing layer "Extent" layer prefetcher extent block extent cache freespace allocator manager flusher & mapper page cache castle_{btree,versions}.c&
  34. 34. Arraysmapping layermodlist btree key Version tree insert btree Disk Layout: RDA key get btree range queries value arrays Cacheblock mapping & cacheing layer "Extent" layer prefetcher extent block extent cache freespace allocator manager flusher & mapper page cachelinuxs block & Linux Kernel MM layers Block layer Memory manager castle_{cache,extent,freespace,rebuild}.c
  35. 35. Disk Layout: RDA random duplicate allocation 4 2 1 4 5 2 5 3 1 3 7 10 7 6 8 9 9 10 6 8 15 12 14 11 13 14 11 12 13 15 16 16
  36. 36. PerformanceComparison
  37. 37. Small random inserts Inserting 3 billion rows Acunu powered Cassandra - ‘standard’ Cassandra -
  38. 38. Insert latencyWhile inserting 3 billion rows Acunu powered Cassandra x ‘standard’ Cassandra +
  39. 39. Small random range queriesPerformed immediately after inserts Acunu powered Cassandra - ‘standard’ Cassandra -
  40. 40. Performance summary Standard Acunu Benefitsinserts rate ~32k/s ~45k/s >1.4x95% latency ~32s ~0.3s >100x gets rate ~100/s ~350/s >3.5x95% latency ~2s ~0.5s >4xrange queries ~0.4/s ~40/s >100x 95% latency ~15s ~2s >7.5x
  41. 41. Future
  42. 42. Memcache + Cassandraget/insert get/put Cass client memcached 100k random inserts/sec! Cassandra memcache Cassandra memcache Castle Castle ... H/W H/W
  43. 43. v1 v1 v1 v1v12 v13 v15 v12 v13 v15 v12 v13 v15 v12 v13 v15 v16 v24 v16 v24 v16 v24 v16 v24
  44. 44. • Castle: like BDB, but for Big Data• DA: transforms random IO into sequential IO• Snapshots & Clones: addressing real problems with new workloads• 2 orders of magnitude better performance and predictability
  45. 45. Questions? Tom Wilkie @tom_wilkie tom@acunu.com http://bitbucket.org/acunu http://github.com/acunuhttp://www.acunu.com/download http://www.acunu.com/insights
  46. 46. References[LSM] The Log-Structured Merge-Tree (LSM-Tree)Patrick ONeil, Edward Cheng, Dieter Gawlick,Elizabeth ONeil Stratified B-trees and versioned dictionaries, - Andy http://staff.ustc.edu.cn/~jpq/paper/flash/1996-The Twigg, Andrew Byde, Grzegorz Miłoś, Tim Moreton, %20Log-Structured%20Merge-Tree%20%28LSM- John Wilkes, Tom Wilkie, HotStorage’11 Tree%29.pdf http://www.usenix.org/event/hotstorage11/tech/ final_files/Twigg.pdf[COLA] Cache-Oblivious Streaming B-trees,Michael A. Bender et al [RDA] Random duplicate storage strategies for http://www.cs.sunysb.edu/~bender/newpub/ load balancing in multimedia servers, 2000, Joep BenderFaFi07.pdf Aerts and Jan Korst and Sebastian Egner http://www.win.tue.nl/~joep/IPL.ps[DSST] Making Data Structures Persistent - J. R.Driscoll, N. Sarnak, D. D. Sleator, R. E. Tarjan, Making Apache, Apache Cassandra, Cassandra, Hadoop, andData Structures Persistent, Journal of Computer the eye and elephant logos are trademarks of theand System Sciences,Vol. 38, No. 1, 1989 Apache Software Foundation. http://www.cs.cmu.edu/~sleator/papers/making- data-structures-persistent.pdf

×