In the brain of Tom Wilkie
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

In the brain of Tom Wilkie

on

  • 3,823 views

A repeat of my oscon talk, with some more details. Check out the video at http://skillsmatter.com/podcast/nosql/castle-big-data

A repeat of my oscon talk, with some more details. Check out the video at http://skillsmatter.com/podcast/nosql/castle-big-data

Statistics

Views

Total Views
3,823
Views on SlideShare
3,758
Embed Views
65

Actions

Likes
2
Downloads
29
Comments
0

6 Embeds 65

http://www.weebly.com 26
http://acunutest.weebly.com 25
http://paper.li 8
http://a0.twimg.com 2
http://www.twylah.com 2
http://www.slashdocs.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

In the brain of Tom Wilkie Presentation Transcript

  • 1. Castle: ReinventingStorage for Big Data Tom WilkieFounder & VP Engineering @tom_wilkie
  • 2. Before the Flood 1990 Small databases BTree indexes BTree File systems RAID Old hardware
  • 3. Two Revolutions 2010 Distributed, shared-nothing databasesWrite-optimised indexes Write-optimised indexesBTree file systems BTree file systems RAID ... RAID New hardware New hardware
  • 4. Bridging the Gap 2011 Distributed, shared-nothing databases Castle Castle ...New hardware New hardware
  • 5. SH OT S*SN AP * And clones!
  • 6. What’s in the Castle?
  • 7. Shared memory interface keys Userspace Acunu Kernel values In-kernel async, shared memory ring workloads interface shared buffersuserspace Streaming interface range key buffered key buffered queries insert value insert get value get interfacekernelspace Doubling Arrays insert Bloom filters queues key get x arrays range arrays queries managementmapping layer keydoubling array insert merges Arrays key Version tree insert btree key get btree rangemodlist btreemapping layer queries value arrays Cache "Extent" layer extent block extent cache freespace allocator prefetcher manager & mapper cacheing layer flusherblock mapping & page cache Linux Kernel Block layer Memory manager MM layerslinuxs block &
  • 8. Shared memory interface Castle keys Userspace Acunu Kerneluserspace interface values In-kernel async, shared memory ring workloads shared bufferskernelspace Streaming interface interface range key buffered key buffered queries insert value insert get value get Doubling Arrays • Opensource (GPLv2, MITdoubling arraymapping layer for user libraries) insert Bloom filters queues key get arrays x range arrays queries management key • insert merges http://bitbucket.org/acunu Arraysmapping layer •modlist btree key Version tree Loadable Kernel Module, insert btree key get btree targeting CentOS’s 2.6.18 range queries value arrays • Cacheblock mapping & http://www.acunu.com/ cacheing layer "Extent" layer prefetcher extent block extent cache blogs/andy-twigg/why- freespace allocator manager flusher & mapper page cache acunu-kernel/linuxs block & Linux Kernel MM layers Block layer Memory manager
  • 9. The Interface Shared memory interface keys Userspace Acunu Kerneluserspace interface values In-kernel async, shared memory ring workloads shared bufferskernelspace Streaming interface interface range key buffered key buffered queries insert value insert get value get Doubling Arraysubling arrayapping layer insert Bloom filters queues key get arrays x range queries castle_{back,objects}.c arrays management
  • 10. The Interface v0 v1 v3v12 v13 v15 v16 v24
  • 11. interfaceuserspac values In-kernel async, shared memory ring workloads shared bufferskernelspace interface Doubling Array Streaming interface range key buffered key buffered queries insert value insert get value get Doubling Arraysdoubling arraymapping layer insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arraysmapping layermodlist btree key Version tree insert btree key get btree range queries value arrays castle_{da,bloom}.c
  • 12. B-TreeB • If node is full, split and insert new node into parent (recurse) logB N • For random inserts, nodes placed randomly on disk
  • 13. Range Query Update (Size Z) O(logB N) O(Z/B)B-Tree random IOs random IOsB = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
  • 14. Doubling Array Inserts 2 2 9 9 Buffer arrays in memoryuntil we have > B of them
  • 15. Doubling Array Inserts11 2 9 2 8 9 11 8 8 11 etc...Similar to log-structured merge trees (LSM), cache-oblivious lookahead array (COLA), ...
  • 16. Demohttps://acunu-videos.s3.amazonaws.com/dajs.html
  • 17. Range Query Update (Size Z) O(logB N) O(Z/B) B-Tree random IOs random IOs O((log N)/B)Doubling Array sequential IOs B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
  • 18. Doubling Array Queries query(k)• Add an index to each array to do lookups• query(k) searches each array independently
  • 19. Doubling Array Queries query(k)• Bloom Filters can help exclude arrays from search• ... but don’t help with range queries
  • 20. 8KB @ 100MB/s, w/ 8ms seek 100 / 5 = 100 IOs/s = 20 updates/s~ log (2^30)/log 100= 5 IOs/update Range Query Update (Size Z) O(logB N) O(Z/B) B-Tree random IOs random IOs O((log N)/B) O(Z/B) Doubling Array sequential IOs sequential IOs ~ log (2^30)/100 8KB @ 100MB/s 13k / 0.2= 0.2 IOs/update = 13k IOs/s = 65k updates/s B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
  • 21. interfaceuserspac values In-kernel async, shared memory ring workloads shared bufferskernelspace interface Doubling Array Streaming interface range key buffered key buffered queries insert value insert get value get Doubling Arraysdoubling arraymapping layer insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arraysmapping layermodlist btree key Version tree insert btree key get btree range queries value arrays castle_{da,bloom}.c
  • 22. Doubling Arraysdoubling arraymapping layer “Mod-list” B-Tree insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arraysmapping layermodlist btree key Version tree insert btree key get btree range queries value arrays Cacheblock mapping & cacheing layer "Extent" layer prefetcher extent block extent cache freespace allocator manager flusher & mapper page cache castle_{btree,versions}.c&
  • 23. Copy-on-Write BTree Idea: • Apply path-copying [DSST] to the B-tree Problems: • Space blowup: Each update may rewrite an entire path • Slow updates: as aboveA log file system makes updates sequential, but relies onrandom access and garbage collection (achilles heel!)
  • 24. Range Update Space QueryCoW B- O(logB Nv) O(Z/B) O(N B logB Nv) Tree random IOs random IOs Nv = #keys live (accessible) at version v
  • 25. “BigTable” snapshotsv1 • Inserts produce arrays1 a 1 b
  • 26. “BigTable” snapshotsv1 v2 • Inserts produce arrays21 a 2 1 b 1 c • Snapshots increment ref counts on arrays • Merges product more arrays, decrement ref count on old arrays
  • 27. “BigTable” snapshotsv1 v2 • Inserts produce arrays1 a 1 b • Snapshots increment ref counts on arrays • Merges product more1 a b c arrays, decrement ref count on old arrays
  • 28. “BigTable” snapshotsv1 v2 • Inserts produce arrays1 a 1 b • Snapshots increment ref counts on arrays • Merges product more1 a b c arrays, decrement ref count on old arrays • Space blowup
  • 29. Range Update Space Query CoW B- O(logB Nv) O(Z/B) O(N B logB Nv) Tree random IOs random IOs“BigTable” O((log N)/B) O(Z/B) O(VN) style DA sequential IOs sequential IOs Nv = #keys live (accessible) at version v
  • 30. “Mod-list” BTree Idea: • Apply fat-nodes [DSST] to the B-tree • ie insert (key, version, value) tuples, with special operations Problems: • Similar performance to a BTreeIf you limit the #versions, can be constructedsequentially, and embedded into a DA
  • 31. Range Update Space Query CoW B- O(logB Nv) O(Z/B) O(N B logB Nv) Tree random IOs random IOs“BigTable” O((log N)/B) O(Z/B) O(VN) LevelDB style DA sequential IOs sequential IOs“Mod-list” O((log N)/B) O(Z/B)CASTLE in a DA sequential IOs sequential IOs O(N) Nv = #keys live (accessible) at version v
  • 32. Stratified BTree v1 v2 v2 v1 v2 v1 v0 v1 v0 v1 v0 v1 Problem: newer older Embedded “Mod- merge (duplicates removed) list” #versions limit Solution: k1 k2 k3 k4 k5 v1 v0 v2 v1 v0 v2 v1 v0 v2 v1 Version-split arrays v-split during merges k1 k2 k3 k4 k5 v0 entries here are {v2} v0 v2 v2 v0 v2 v0 duplicates k1 k2 k4 k5 v1 v2{v1,v0} v1 v0 v1 v0 v1 v0 v1
  • 33. Doubling Arraysdoubling arraymapping layer “Mod-list” B-Tree insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arraysmapping layermodlist btree key Version tree insert btree key get btree range queries value arrays Cacheblock mapping & cacheing layer "Extent" layer prefetcher extent block extent cache freespace allocator manager flusher & mapper page cache castle_{btree,versions}.c&
  • 34. Arraysmapping layermodlist btree key Version tree insert btree Disk Layout: RDA key get btree range queries value arrays Cacheblock mapping & cacheing layer "Extent" layer prefetcher extent block extent cache freespace allocator manager flusher & mapper page cachelinuxs block & Linux Kernel MM layers Block layer Memory manager castle_{cache,extent,freespace,rebuild}.c
  • 35. Disk Layout: RDA random duplicate allocation 4 2 1 4 5 2 5 3 1 3 7 10 7 6 8 9 9 10 6 8 15 12 14 11 13 14 11 12 13 15 16 16
  • 36. PerformanceComparison
  • 37. Small random inserts Inserting 3 billion rows Acunu powered Cassandra - ‘standard’ Cassandra -
  • 38. Insert latencyWhile inserting 3 billion rows Acunu powered Cassandra x ‘standard’ Cassandra +
  • 39. Small random range queriesPerformed immediately after inserts Acunu powered Cassandra - ‘standard’ Cassandra -
  • 40. Performance summary Standard Acunu Benefitsinserts rate ~32k/s ~45k/s >1.4x95% latency ~32s ~0.3s >100x gets rate ~100/s ~350/s >3.5x95% latency ~2s ~0.5s >4xrange queries ~0.4/s ~40/s >100x 95% latency ~15s ~2s >7.5x
  • 41. Future
  • 42. Memcache + Cassandraget/insert get/put Cass client memcached 100k random inserts/sec! Cassandra memcache Cassandra memcache Castle Castle ... H/W H/W
  • 43. v1 v1 v1 v1v12 v13 v15 v12 v13 v15 v12 v13 v15 v12 v13 v15 v16 v24 v16 v24 v16 v24 v16 v24
  • 44. • Castle: like BDB, but for Big Data• DA: transforms random IO into sequential IO• Snapshots & Clones: addressing real problems with new workloads• 2 orders of magnitude better performance and predictability
  • 45. Questions? Tom Wilkie @tom_wilkie tom@acunu.com http://bitbucket.org/acunu http://github.com/acunuhttp://www.acunu.com/download http://www.acunu.com/insights
  • 46. References[LSM] The Log-Structured Merge-Tree (LSM-Tree)Patrick ONeil, Edward Cheng, Dieter Gawlick,Elizabeth ONeil Stratified B-trees and versioned dictionaries, - Andy http://staff.ustc.edu.cn/~jpq/paper/flash/1996-The Twigg, Andrew Byde, Grzegorz Miłoś, Tim Moreton, %20Log-Structured%20Merge-Tree%20%28LSM- John Wilkes, Tom Wilkie, HotStorage’11 Tree%29.pdf http://www.usenix.org/event/hotstorage11/tech/ final_files/Twigg.pdf[COLA] Cache-Oblivious Streaming B-trees,Michael A. Bender et al [RDA] Random duplicate storage strategies for http://www.cs.sunysb.edu/~bender/newpub/ load balancing in multimedia servers, 2000, Joep BenderFaFi07.pdf Aerts and Jan Korst and Sebastian Egner http://www.win.tue.nl/~joep/IPL.ps[DSST] Making Data Structures Persistent - J. R.Driscoll, N. Sarnak, D. D. Sleator, R. E. Tarjan, Making Apache, Apache Cassandra, Cassandra, Hadoop, andData Structures Persistent, Journal of Computer the eye and elephant logos are trademarks of theand System Sciences,Vol. 38, No. 1, 1989 Apache Software Foundation. http://www.cs.cmu.edu/~sleator/papers/making- data-structures-persistent.pdf