SlideShare a Scribd company logo
1 of 41
Castle: Reinventing
Storage for Big Data

      Tom Wilkie
Founder & VP Engineering
Before the Flood
         1990

     Small databases
      BTree indexes
    BTree File systems
          RAID
      Old hardware
Two Revolutions
                          2010
     Distributed, shared-nothing databases
Write-optimised indexes          Write-optimised indexes

BTree file systems                BTree file systems
       RAID                ...          RAID
 New hardware                     New hardware
Bridging the Gap
                 2011

  Distributed, shared-nothing databases


   Castle                      Castle
                   ...
New hardware               New hardware
With Big Data, how
      do I...


               S HO TS
       SN AP
What’s in the
 Castle?
Shared memory interface
                                                            keys
                                                                                                                    Userspace
                                                                                                                  Acunu Kernel
                                                     values
                                                                                        In-kernel
                                async, shared
                                 memory ring                                            workloads




 interface
                                                       shared buffers




userspace
                                              Streaming interface
                    range           key              buffered              key           buffered
                   queries         insert           value insert           get           value get




  interface
kernelspace
                                                    Doubling Arrays
                                insert                                                                             Bloom filters
                               queues                                                      key
                                                                                           get                            x
                   arrays
                    range                                       arrays
                   queries                                    management




mapping layer
                                key




doubling array
                               insert                         merges




                                                       Arrays
                                 key                                                      Version tree
                                insert                         btree
                                                                                  key
                                                                                  get
                    btree
                    range




modlist btree
mapping layer
                   queries                          value arrays




                                                                                 Cache
                        "Extent" layer
                                                                       extent block
                                         extent                           cache
                   freespace
                                        allocator




                                                                                                     prefetcher
                    manager
                                        & mapper




 cacheing layer
                                                                                           flusher




block mapping &
                                                                        page cache




                                                                                                                  Linux Kernel
                         Block layer                                     Memory manager




   MM layers
linux's block &
Shared memory interface




                                                                                                                                         Castle
                                                            keys
                                                                                                                    Userspace
                                                                                                                  Acunu Kernel
userspace
 interface



                                                     values
                                                                                        In-kernel
                                async, shared
                                 memory ring                                            workloads
                                                       shared buffers
kernelspace




                                              Streaming interface
  interface




                    range           key              buffered              key           buffered
                   queries         insert           value insert           get           value get




                                                    Doubling Arrays                                                               •   Opensource (GPLv2, MIT
doubling array
mapping layer




                                                                                                                                      for user libraries)
                                insert                                                                             Bloom filters
                               queues                                                      key
                                                                                           get
                   arrays                                                                                                 x
                    range                                       arrays
                   queries                                    management
                                key




                                                                                                                                  •
                               insert                         merges


                                                                                                                                      http://bitbucket.org/acunu
                                                       Arrays
mapping layer




                                                                                                                                  •
modlist btree




                                 key                                                      Version tree


                                                                                                                                      Loadable Kernel Module,
                                insert                         btree
                                                                                  key
                                                                                  get
                    btree


                                                                                                                                      targeting CentOS’s 2.6.18
                    range
                   queries                          value arrays




                                                                                                                                  •
                                                                                 Cache
block mapping &




                                                                                                                                      http://www.acunu.com/
 cacheing layer




                        "Extent" layer
                                                                                                     prefetcher




                                                                       extent block
                                         extent                           cache


                                                                                                                                      blogs/andy-twigg/why-
                   freespace
                                        allocator
                    manager
                                                                                           flusher




                                        & mapper

                                                                        page cache


                                                                                                                                      acunu-kernel/
linux's block &




                                                                                                                  Linux Kernel
   MM layers




                         Block layer                                     Memory manager
The Interface
               Shared memory interface
                                                   keys
                                                                                    Userspace
                                                                                  Acunu Kernel
userspace
 interface




                                            values
                                                                     In-kernel
                           async, shared
                            memory ring                              workloads
                                              shared buffers
kernelspace




                                       Streaming interface
  interface




                 range         key          buffered           key    buffered
                queries       insert       value insert        get    value get




                                           Doubling Arrays
ubling array
apping layer




                           insert                                                  Bloom filters
                          queues                                        key
                                                                        get
                arrays                                                                    x
                 range
                queries
                                               castle_{back,objects}.c
                                                   arrays
                                                 management
interface
userspac
                                             values
                                                                            In-kernel
                            async, shared
                             memory ring                                    workloads
                                               shared buffers
kernelspace
  interface
                 Doubling Array         Streaming interface
                  range         key          buffered           key          buffered
                 queries       insert       value insert        get          value get




                                            Doubling Arrays
doubling array
mapping layer




                            insert                                                           Bloom filters
                           queues                                              key
                                                                               get
                 arrays                                                                             x
                  range                                 arrays
                 queries                              management
                            key
                           insert                     merges




                                               Arrays
mapping layer
modlist btree




                             key                                              Version tree
                            insert                     btree
                                                                      key
                                                                      get
                  btree
                  range
                 queries                    value arrays

                                                               castle_{da,bloom}.c
Range Query
                      Update
                                          (Size Z)
                       O(logB N)             O(Z/B)
B-Tree                random IOs           random IOs




B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
Doubling Array
                      Inserts


   2        2   9


   9




 Buffer arrays in memory
until we have > B of them
Doubling Array
                       Inserts


11          2   9       2   8   9   11


 8          8   11
                                              etc...



Similar to log-structured merge trees (LSM), cache-
oblivious lookahead array (COLA), ...
Demo
https://acunu-videos.s3.amazonaws.com/dajs.html
Range Query
                         Update
                                             (Size Z)
                          O(logB N)             O(Z/B)
   B-Tree                random IOs           random IOs

                         O((log N)/B)
Doubling Array          sequential IOs




   B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
Doubling Array
                 Queries




              query(k)

• Add an index to each array to do lookups
• query(k) searches each array independently
Doubling Array
                   Queries




                query(k)

• Bloom Filters can help exclude arrays from
  search
• ... but don’t help with range queries
8KB @ 100MB/s, w/ 8ms seek      100 / 5
                        = 100 IOs/s          = 20 updates/s
~ log (2^30)/log 100
= 5 IOs/update
                                            Range Query
                           Update
                                               (Size Z)
                            O(logB N)             O(Z/B)
      B-Tree               random IOs           random IOs

                           O((log N)/B)           O(Z/B)
 Doubling Array           sequential IOs       sequential IOs



  ~ log (2^30)/100       8KB @ 100MB/s             13k / 0.2
= 0.2 IOs/update          = 13k IOs/s          = 65k updates/s

     B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
interface
userspac
                                             values
                                                                            In-kernel
                            async, shared
                             memory ring                                    workloads
                                               shared buffers
kernelspace
  interface
                 Doubling Array         Streaming interface
                  range         key          buffered           key          buffered
                 queries       insert       value insert        get          value get




                                            Doubling Arrays
doubling array
mapping layer




                            insert                                                           Bloom filters
                           queues                                              key
                                                                               get
                 arrays                                                                             x
                  range                                 arrays
                 queries                              management
                            key
                           insert                     merges




                                               Arrays
mapping layer
modlist btree




                             key                                              Version tree
                            insert                     btree
                                                                      key
                                                                      get
                  btree
                  range
                 queries                    value arrays

                                                               castle_{da,bloom}.c
Doubling Arrays

doubling array
mapping layer
                  “Mod-list” B-Tree
                                insert                                                                       Bloom filters
                               queues                                                 key
                                                                                      get
                   arrays                                                                                           x
                    range                                      arrays
                   queries                                   management
                                key
                               insert                       merges




                                                      Arrays
mapping layer
modlist btree




                                 key                                                  Version tree
                                insert                        btree
                                                                                key
                                                                                get
                    btree
                    range
                   queries                          value arrays




                                                                             Cache
block mapping &
 cacheing layer




                        "Extent" layer




                                                                                                prefetcher
                                                                      extent block
                                         extent                          cache
                   freespace
                                        allocator
                    manager




                                                                                       flusher
                                        & mapper

                                                                      page cache


                                                    castle_{btree,versions}.c
&
Copy-on-Write BTree
                             Idea:
                          • Apply path-copying [DSST] to
                             the B-tree
                             Problems:
                          • Space blowup: Each update may
                             rewrite an entire path
                          • Slow updates: as above
A log file system makes updates sequential, but relies on
random access and garbage collection (achilles heel!)
Range
             Update                          Space
                            Query
CoW B-        O(logB Nv)      O(Z/B)
                                         O(N B logB Nv)
 Tree        random IOs     random IOs




 Nv = #keys live (accessible) at version v
“BigTable” snapshots
v1
                   • Inserts produce arrays
1    a   1   b
“BigTable” snapshots
v1       v2
                          • Inserts produce arrays
2
1    a   2
         1    b   1   c   • Snapshots increment ref
                            counts on arrays
                          • Merges product more
                            arrays, decrement ref
                            count on old arrays
“BigTable” snapshots
v1       v2
                         • Inserts produce arrays
1    a       1       b   • Snapshots increment ref
                           counts on arrays
                         • Merges product more
1    a   b       c


                           arrays, decrement ref
                           count on old arrays
“BigTable” snapshots
v1       v2
                         • Inserts produce arrays
1    a       1       b   • Snapshots increment ref
                           counts on arrays
                         • Merges product more
1    a   b       c


                           arrays, decrement ref
                           count on old arrays
                         • Space blowup
Range
               Update                             Space
                                Query
 CoW B-         O(logB Nv)        O(Z/B)
                                                O(N B logB Nv)
  Tree         random IOs       random IOs

“BigTable”     O((log N)/B)       O(Z/B)
                                                   O(VN)
 style DA     sequential IOs   sequential IOs




   Nv = #keys live (accessible) at version v
“Mod-list” BTree
                                Idea:
                             • Apply fat-nodes [DSST] to the
                                B-tree
                             • ie insert (key, version, value)
                                tuples, with special operations
                                Problems:
                             • Similar performance to a BTree
If you limit the #versions, can be constructed
sequentially, and embedded into a DA
Range
               Update                             Space
                                Query
 CoW B-         O(logB Nv)        O(Z/B)
                                                O(N B logB Nv)
  Tree         random IOs       random IOs

“BigTable”     O((log N)/B)       O(Z/B)
                                                   O(VN)
  LevelDB
 style DA     sequential IOs   sequential IOs

“Mod-list”     O((log N)/B)       O(Z/B)
CASTLE
 in a DA      sequential IOs   sequential IOs
                                                    O(N)



   Nv = #keys live (accessible) at version v
Stratified BTree
        v1   v2        v2    v1        v2    v1                        v0    v1        v0    v1   v0   v1
                                                                                                                  Problem:
              newer                                                                   older                       Embedded “Mod-
                                                            merge                          (duplicates removed)   list” #versions limit
                                                                                                                  Solution:
                   k1                       k2         k3        k4              k5

             v1        v0   v2        v1     v0    v2       v1        v0    v2        v1

                                                                                                                  Version-split arrays
                                                            v-split                                               during merges
                            k1        k2    k3     k4       k5
                                                                                           v0 entries here are
 {v2}                       v0        v2    v2     v0       v2                                                               v0
                                                                                               duplicates

                  k1             k2               k4        k5
                                                                                                                        v1        v2
{v1,v0}      v1        v0   v1        v0     v1    v0       v1
Doubling Arrays

doubling array
mapping layer
                  “Mod-list” B-Tree
                                insert                                                                       Bloom filters
                               queues                                                 key
                                                                                      get
                   arrays                                                                                           x
                    range                                      arrays
                   queries                                   management
                                key
                               insert                       merges




                                                      Arrays
mapping layer
modlist btree




                                 key                                                  Version tree
                                insert                        btree
                                                                                key
                                                                                get
                    btree
                    range
                   queries                          value arrays




                                                                             Cache
block mapping &
 cacheing layer




                        "Extent" layer




                                                                                                prefetcher
                                                                      extent block
                                         extent                          cache
                   freespace
                                        allocator
                    manager




                                                                                       flusher
                                        & mapper

                                                                      page cache


                                                    castle_{btree,versions}.c
&
Arrays




mapping layer
modlist btree
                                key                                               Version tree
                               insert                     btree




                  Disk Layout: RDA
                                                                            key
                                                                            get
                    btree
                    range
                   queries                      value arrays




                                                                         Cache
block mapping &
 cacheing layer




                        "Extent" layer




                                                                                            prefetcher
                                                                  extent block
                                     extent                          cache
                   freespace
                                    allocator
                    manager




                                                                                   flusher
                                    & mapper

                                                                  page cache
linux's block &




                                                                                                         Linux Kernel
   MM layers




                         Block layer                                Memory manager




        castle_{cache,extent,freespace,rebuild}.c
Disk Layout: RDA
          random duplicate allocation

 4    2      1    4    5    2    5    3    1    3

 7    10     7    6    8    9    9    10   6    8

 15   12     14   11   13   14   11   12   13   15

                                 16        16
Performance
Comparison
Small random inserts
  Inserting 3 billion rows


                 Acunu powered Cassandra -
                      ‘standard’ Cassandra -
Insert latency
While inserting 3 billion rows

                     Acunu powered Cassandra x
                          ‘standard’ Cassandra +
Small random range queries
Performed immediately after inserts

                       Acunu powered Cassandra -
                            ‘standard’ Cassandra -
Performance summary
                Standard   Acunu    Benefits

inserts rate     ~32k/s    ~45k/s   >1.4x
95% latency       ~32s     ~0.3s    >100x
 gets rate       ~100/s    ~350/s    >3.5x
95% latency       ~2s      ~0.5s      >4x
range queries    ~0.4/s    ~40/s    >100x
 95% latency     ~15s       ~2s     >7.5x
• Castle: like BDB, but for Big Data
• DA: transforms random IO into
  sequential IO

• Snapshots & Clones: addressing
  real problems with new workloads

• 2 orders of magnitude better
  performance and predictability
Questions?
        Tom Wilkie
         @tom_wilkie
       tom@acunu.com

    http://bitbucket.org/acunu
http://www.acunu.com/download
 http://www.acunu.com/insights
References
[LSM] The Log-Structured Merge-Tree (LSM-Tree)
Patrick O'Neil, Edward Cheng, Dieter Gawlick,
Elizabeth O'Neil                                           Stratified B-trees and versioned dictionaries, - Andy
    http://staff.ustc.edu.cn/~jpq/paper/flash/1996-The      Twigg, Andrew Byde, Grzegorz Miłoś, Tim Moreton,
     %20Log-Structured%20Merge-Tree%20%28LSM-              John Wilkes, Tom Wilkie, HotStorage’11
                                          Tree%29.pdf          http://www.usenix.org/event/hotstorage11/tech/
                                                                                            final_files/Twigg.pdf
[COLA] Cache-Oblivious Streaming B-trees,
Michael A. Bender et al                                    [RDA] Random duplicate storage strategies for
        http://www.cs.sunysb.edu/~bender/newpub/           load balancing in multimedia servers, 2000, Joep
                                 BenderFaFi07.pdf          Aerts and Jan Korst and Sebastian Egner
                                                                             http://www.win.tue.nl/~joep/IPL.ps
[DSST] Making Data Structures Persistent - J. R.
Driscoll, N. Sarnak, D. D. Sleator, R. E. Tarjan, Making   Apache, Apache Cassandra, Cassandra, Hadoop, and
Data Structures Persistent, Journal of Computer              the eye and elephant logos are trademarks of the
and System Sciences,Vol. 38, No. 1, 1989                                        Apache Software Foundation.
    http://www.cs.cmu.edu/~sleator/papers/making-
                        data-structures-persistent.pdf

More Related Content

What's hot

IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...npinto
 
Presentation of the open source CFD code Code_Saturne
Presentation of the open source CFD code Code_SaturnePresentation of the open source CFD code Code_Saturne
Presentation of the open source CFD code Code_SaturneRenuda SARL
 
Cloumon enterprise
Cloumon enterpriseCloumon enterprise
Cloumon enterpriseGruter
 
Stefano Giordano
Stefano GiordanoStefano Giordano
Stefano GiordanoGoWireless
 
Rc111 010d-wcf
Rc111 010d-wcfRc111 010d-wcf
Rc111 010d-wcfKayvan Sh
 
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...balmanme
 
iMinds The Conference: Jan Lemeire
iMinds The Conference: Jan LemeireiMinds The Conference: Jan Lemeire
iMinds The Conference: Jan Lemeireimec
 
Building Applications Using NoSQL Architectures on top of SQL Azure: How MSN ...
Building Applications Using NoSQL Architectures on top of SQL Azure: How MSN ...Building Applications Using NoSQL Architectures on top of SQL Azure: How MSN ...
Building Applications Using NoSQL Architectures on top of SQL Azure: How MSN ...DATAVERSITY
 
Oracle Arch
Oracle ArchOracle Arch
Oracle Archqiuye
 

What's hot (15)

MXF & AAF
MXF & AAFMXF & AAF
MXF & AAF
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
 
Presentation of the open source CFD code Code_Saturne
Presentation of the open source CFD code Code_SaturnePresentation of the open source CFD code Code_Saturne
Presentation of the open source CFD code Code_Saturne
 
System events concept presentation
System events concept presentationSystem events concept presentation
System events concept presentation
 
Threads 2x[1]
Threads 2x[1]Threads 2x[1]
Threads 2x[1]
 
Cloumon enterprise
Cloumon enterpriseCloumon enterprise
Cloumon enterprise
 
Stefano Giordano
Stefano GiordanoStefano Giordano
Stefano Giordano
 
Rc111 010d-wcf
Rc111 010d-wcfRc111 010d-wcf
Rc111 010d-wcf
 
Session9part2 Servers Detailed
Session9part2  Servers DetailedSession9part2  Servers Detailed
Session9part2 Servers Detailed
 
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
 
iMinds The Conference: Jan Lemeire
iMinds The Conference: Jan LemeireiMinds The Conference: Jan Lemeire
iMinds The Conference: Jan Lemeire
 
Building Applications Using NoSQL Architectures on top of SQL Azure: How MSN ...
Building Applications Using NoSQL Architectures on top of SQL Azure: How MSN ...Building Applications Using NoSQL Architectures on top of SQL Azure: How MSN ...
Building Applications Using NoSQL Architectures on top of SQL Azure: How MSN ...
 
MetaCDN
MetaCDNMetaCDN
MetaCDN
 
Oracle Arch
Oracle ArchOracle Arch
Oracle Arch
 
Network Management in System Center 2012 SP1 - VMM
Network Management in System Center 2012  SP1 - VMM Network Management in System Center 2012  SP1 - VMM
Network Management in System Center 2012 SP1 - VMM
 

Similar to Castle: Reinventing Storage for Big Data

Acunu & OCaml: Experience Report, CUFP
Acunu & OCaml: Experience Report, CUFPAcunu & OCaml: Experience Report, CUFP
Acunu & OCaml: Experience Report, CUFPAcunu
 
High Availability with Novell Cluster Services for Novell Open Enterprise Ser...
High Availability with Novell Cluster Services for Novell Open Enterprise Ser...High Availability with Novell Cluster Services for Novell Open Enterprise Ser...
High Availability with Novell Cluster Services for Novell Open Enterprise Ser...Novell
 
Intro to CloudStack Build a Cloud Day
Intro to CloudStack Build a Cloud DayIntro to CloudStack Build a Cloud Day
Intro to CloudStack Build a Cloud DaySebastien Goasguen
 
Balancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBalancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBen Stopford
 
Ca บทที่สี่
Ca บทที่สี่Ca บทที่สี่
Ca บทที่สี่atit604
 
ASPLOS2011 workshop RESoLVE "Effect of Disk Prefetching of Guest OS "
ASPLOS2011 workshop RESoLVE "Effect of Disk Prefetching of Guest OS "ASPLOS2011 workshop RESoLVE "Effect of Disk Prefetching of Guest OS "
ASPLOS2011 workshop RESoLVE "Effect of Disk Prefetching of Guest OS "Kuniyasu Suzaki
 
Sürat Teknoloji EMC Forum Isilon Sunumu
Sürat Teknoloji EMC Forum Isilon SunumuSürat Teknoloji EMC Forum Isilon Sunumu
Sürat Teknoloji EMC Forum Isilon SunumuSürat Teknoloji
 
Sparse Content Map Storage System
Sparse Content Map Storage SystemSparse Content Map Storage System
Sparse Content Map Storage Systemianeboston
 
GTC 2012: NVIDIA OpenGL in 2012
GTC 2012: NVIDIA OpenGL in 2012GTC 2012: NVIDIA OpenGL in 2012
GTC 2012: NVIDIA OpenGL in 2012Mark Kilgard
 
CloudStack Architecture Future
CloudStack Architecture FutureCloudStack Architecture Future
CloudStack Architecture FutureKimihiko Kitase
 
Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...
Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...
Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...srisatish ambati
 
Windows Server 2012 Active Directory Domain and Trust (Forest Trust)
Windows Server 2012 Active Directory Domain and Trust (Forest Trust)Windows Server 2012 Active Directory Domain and Trust (Forest Trust)
Windows Server 2012 Active Directory Domain and Trust (Forest Trust)Serhad MAKBULOĞLU, MBA
 
stream processing engine
stream processing enginestream processing engine
stream processing enginetiana528
 
Less01 architecture
Less01 architectureLess01 architecture
Less01 architectureAmit Bhalla
 
Windsor: Domain 0 Disaggregation for XenServer and XCP
	Windsor: Domain 0 Disaggregation for XenServer and XCP	Windsor: Domain 0 Disaggregation for XenServer and XCP
Windsor: Domain 0 Disaggregation for XenServer and XCPThe Linux Foundation
 
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVMTUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVMMediaEval2012
 
CloudStack for Java User Group
CloudStack for Java User GroupCloudStack for Java User Group
CloudStack for Java User GroupSebastien Goasguen
 

Similar to Castle: Reinventing Storage for Big Data (20)

Acunu & OCaml: Experience Report, CUFP
Acunu & OCaml: Experience Report, CUFPAcunu & OCaml: Experience Report, CUFP
Acunu & OCaml: Experience Report, CUFP
 
High Availability with Novell Cluster Services for Novell Open Enterprise Ser...
High Availability with Novell Cluster Services for Novell Open Enterprise Ser...High Availability with Novell Cluster Services for Novell Open Enterprise Ser...
High Availability with Novell Cluster Services for Novell Open Enterprise Ser...
 
DevCloud and CloudMonkey
DevCloud and CloudMonkeyDevCloud and CloudMonkey
DevCloud and CloudMonkey
 
JCR In 10 Minutes
JCR In 10 MinutesJCR In 10 Minutes
JCR In 10 Minutes
 
Intro to CloudStack Build a Cloud Day
Intro to CloudStack Build a Cloud DayIntro to CloudStack Build a Cloud Day
Intro to CloudStack Build a Cloud Day
 
Balancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBalancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java Database
 
Ca บทที่สี่
Ca บทที่สี่Ca บทที่สี่
Ca บทที่สี่
 
Intro to Cloudstack
Intro to CloudstackIntro to Cloudstack
Intro to Cloudstack
 
ASPLOS2011 workshop RESoLVE "Effect of Disk Prefetching of Guest OS "
ASPLOS2011 workshop RESoLVE "Effect of Disk Prefetching of Guest OS "ASPLOS2011 workshop RESoLVE "Effect of Disk Prefetching of Guest OS "
ASPLOS2011 workshop RESoLVE "Effect of Disk Prefetching of Guest OS "
 
Sürat Teknoloji EMC Forum Isilon Sunumu
Sürat Teknoloji EMC Forum Isilon SunumuSürat Teknoloji EMC Forum Isilon Sunumu
Sürat Teknoloji EMC Forum Isilon Sunumu
 
Sparse Content Map Storage System
Sparse Content Map Storage SystemSparse Content Map Storage System
Sparse Content Map Storage System
 
GTC 2012: NVIDIA OpenGL in 2012
GTC 2012: NVIDIA OpenGL in 2012GTC 2012: NVIDIA OpenGL in 2012
GTC 2012: NVIDIA OpenGL in 2012
 
CloudStack Architecture Future
CloudStack Architecture FutureCloudStack Architecture Future
CloudStack Architecture Future
 
Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...
Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...
Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...
 
Windows Server 2012 Active Directory Domain and Trust (Forest Trust)
Windows Server 2012 Active Directory Domain and Trust (Forest Trust)Windows Server 2012 Active Directory Domain and Trust (Forest Trust)
Windows Server 2012 Active Directory Domain and Trust (Forest Trust)
 
stream processing engine
stream processing enginestream processing engine
stream processing engine
 
Less01 architecture
Less01 architectureLess01 architecture
Less01 architecture
 
Windsor: Domain 0 Disaggregation for XenServer and XCP
	Windsor: Domain 0 Disaggregation for XenServer and XCP	Windsor: Domain 0 Disaggregation for XenServer and XCP
Windsor: Domain 0 Disaggregation for XenServer and XCP
 
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVMTUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM
 
CloudStack for Java User Group
CloudStack for Java User GroupCloudStack for Java User Group
CloudStack for Java User Group
 

More from Acunu

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinAcunu
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsAcunu
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu
 
All Your Base
All Your BaseAll Your Base
All Your BaseAcunu
 
Realtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraRealtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraAcunu
 
Realtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX LondonRealtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX LondonAcunu
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time CassandraAcunu
 
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Acunu
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with CassandraAcunu
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your businessAcunu
 
Realtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraRealtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraAcunu
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: CassandraAcunu
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Acunu
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraAcunu
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsAcunu
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Acunu
 
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixCassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixAcunu
 

More from Acunu (20)

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on Cassandra
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational Aspirin
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra Apps
 
All Your Base
All Your BaseAll Your Base
All Your Base
 
Realtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraRealtime Analytics with Apache Cassandra
Realtime Analytics with Apache Cassandra
 
Realtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX LondonRealtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX London
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
 
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with Cassandra
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra London
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your business
 
Realtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraRealtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with Cassandra
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: Cassandra
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into Cassandra
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
 
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixCassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
 

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

Castle: Reinventing Storage for Big Data

  • 1. Castle: Reinventing Storage for Big Data Tom Wilkie Founder & VP Engineering
  • 2. Before the Flood 1990 Small databases BTree indexes BTree File systems RAID Old hardware
  • 3. Two Revolutions 2010 Distributed, shared-nothing databases Write-optimised indexes Write-optimised indexes BTree file systems BTree file systems RAID ... RAID New hardware New hardware
  • 4. Bridging the Gap 2011 Distributed, shared-nothing databases Castle Castle ... New hardware New hardware
  • 5. With Big Data, how do I... S HO TS SN AP
  • 6. What’s in the Castle?
  • 7. Shared memory interface keys Userspace Acunu Kernel values In-kernel async, shared memory ring workloads interface shared buffers userspace Streaming interface range key buffered key buffered queries insert value insert get value get interface kernelspace Doubling Arrays insert Bloom filters queues key get x arrays range arrays queries management mapping layer key doubling array insert merges Arrays key Version tree insert btree key get btree range modlist btree mapping layer queries value arrays Cache "Extent" layer extent block extent cache freespace allocator prefetcher manager & mapper cacheing layer flusher block mapping & page cache Linux Kernel Block layer Memory manager MM layers linux's block &
  • 8. Shared memory interface Castle keys Userspace Acunu Kernel userspace interface values In-kernel async, shared memory ring workloads shared buffers kernelspace Streaming interface interface range key buffered key buffered queries insert value insert get value get Doubling Arrays • Opensource (GPLv2, MIT doubling array mapping layer for user libraries) insert Bloom filters queues key get arrays x range arrays queries management key • insert merges http://bitbucket.org/acunu Arrays mapping layer • modlist btree key Version tree Loadable Kernel Module, insert btree key get btree targeting CentOS’s 2.6.18 range queries value arrays • Cache block mapping & http://www.acunu.com/ cacheing layer "Extent" layer prefetcher extent block extent cache blogs/andy-twigg/why- freespace allocator manager flusher & mapper page cache acunu-kernel/ linux's block & Linux Kernel MM layers Block layer Memory manager
  • 9. The Interface Shared memory interface keys Userspace Acunu Kernel userspace interface values In-kernel async, shared memory ring workloads shared buffers kernelspace Streaming interface interface range key buffered key buffered queries insert value insert get value get Doubling Arrays ubling array apping layer insert Bloom filters queues key get arrays x range queries castle_{back,objects}.c arrays management
  • 10. interface userspac values In-kernel async, shared memory ring workloads shared buffers kernelspace interface Doubling Array Streaming interface range key buffered key buffered queries insert value insert get value get Doubling Arrays doubling array mapping layer insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree key Version tree insert btree key get btree range queries value arrays castle_{da,bloom}.c
  • 11. Range Query Update (Size Z) O(logB N) O(Z/B) B-Tree random IOs random IOs B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
  • 12. Doubling Array Inserts 2 2 9 9 Buffer arrays in memory until we have > B of them
  • 13. Doubling Array Inserts 11 2 9 2 8 9 11 8 8 11 etc... Similar to log-structured merge trees (LSM), cache- oblivious lookahead array (COLA), ...
  • 15. Range Query Update (Size Z) O(logB N) O(Z/B) B-Tree random IOs random IOs O((log N)/B) Doubling Array sequential IOs B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
  • 16. Doubling Array Queries query(k) • Add an index to each array to do lookups • query(k) searches each array independently
  • 17. Doubling Array Queries query(k) • Bloom Filters can help exclude arrays from search • ... but don’t help with range queries
  • 18. 8KB @ 100MB/s, w/ 8ms seek 100 / 5 = 100 IOs/s = 20 updates/s ~ log (2^30)/log 100 = 5 IOs/update Range Query Update (Size Z) O(logB N) O(Z/B) B-Tree random IOs random IOs O((log N)/B) O(Z/B) Doubling Array sequential IOs sequential IOs ~ log (2^30)/100 8KB @ 100MB/s 13k / 0.2 = 0.2 IOs/update = 13k IOs/s = 65k updates/s B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries
  • 19. interface userspac values In-kernel async, shared memory ring workloads shared buffers kernelspace interface Doubling Array Streaming interface range key buffered key buffered queries insert value insert get value get Doubling Arrays doubling array mapping layer insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree key Version tree insert btree key get btree range queries value arrays castle_{da,bloom}.c
  • 20. Doubling Arrays doubling array mapping layer “Mod-list” B-Tree insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree key Version tree insert btree key get btree range queries value arrays Cache block mapping & cacheing layer "Extent" layer prefetcher extent block extent cache freespace allocator manager flusher & mapper page cache castle_{btree,versions}.c &
  • 21. Copy-on-Write BTree Idea: • Apply path-copying [DSST] to the B-tree Problems: • Space blowup: Each update may rewrite an entire path • Slow updates: as above A log file system makes updates sequential, but relies on random access and garbage collection (achilles heel!)
  • 22. Range Update Space Query CoW B- O(logB Nv) O(Z/B) O(N B logB Nv) Tree random IOs random IOs Nv = #keys live (accessible) at version v
  • 23. “BigTable” snapshots v1 • Inserts produce arrays 1 a 1 b
  • 24. “BigTable” snapshots v1 v2 • Inserts produce arrays 2 1 a 2 1 b 1 c • Snapshots increment ref counts on arrays • Merges product more arrays, decrement ref count on old arrays
  • 25. “BigTable” snapshots v1 v2 • Inserts produce arrays 1 a 1 b • Snapshots increment ref counts on arrays • Merges product more 1 a b c arrays, decrement ref count on old arrays
  • 26. “BigTable” snapshots v1 v2 • Inserts produce arrays 1 a 1 b • Snapshots increment ref counts on arrays • Merges product more 1 a b c arrays, decrement ref count on old arrays • Space blowup
  • 27. Range Update Space Query CoW B- O(logB Nv) O(Z/B) O(N B logB Nv) Tree random IOs random IOs “BigTable” O((log N)/B) O(Z/B) O(VN) style DA sequential IOs sequential IOs Nv = #keys live (accessible) at version v
  • 28. “Mod-list” BTree Idea: • Apply fat-nodes [DSST] to the B-tree • ie insert (key, version, value) tuples, with special operations Problems: • Similar performance to a BTree If you limit the #versions, can be constructed sequentially, and embedded into a DA
  • 29. Range Update Space Query CoW B- O(logB Nv) O(Z/B) O(N B logB Nv) Tree random IOs random IOs “BigTable” O((log N)/B) O(Z/B) O(VN) LevelDB style DA sequential IOs sequential IOs “Mod-list” O((log N)/B) O(Z/B) CASTLE in a DA sequential IOs sequential IOs O(N) Nv = #keys live (accessible) at version v
  • 30. Stratified BTree v1 v2 v2 v1 v2 v1 v0 v1 v0 v1 v0 v1 Problem: newer older Embedded “Mod- merge (duplicates removed) list” #versions limit Solution: k1 k2 k3 k4 k5 v1 v0 v2 v1 v0 v2 v1 v0 v2 v1 Version-split arrays v-split during merges k1 k2 k3 k4 k5 v0 entries here are {v2} v0 v2 v2 v0 v2 v0 duplicates k1 k2 k4 k5 v1 v2 {v1,v0} v1 v0 v1 v0 v1 v0 v1
  • 31. Doubling Arrays doubling array mapping layer “Mod-list” B-Tree insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree key Version tree insert btree key get btree range queries value arrays Cache block mapping & cacheing layer "Extent" layer prefetcher extent block extent cache freespace allocator manager flusher & mapper page cache castle_{btree,versions}.c &
  • 32. Arrays mapping layer modlist btree key Version tree insert btree Disk Layout: RDA key get btree range queries value arrays Cache block mapping & cacheing layer "Extent" layer prefetcher extent block extent cache freespace allocator manager flusher & mapper page cache linux's block & Linux Kernel MM layers Block layer Memory manager castle_{cache,extent,freespace,rebuild}.c
  • 33. Disk Layout: RDA random duplicate allocation 4 2 1 4 5 2 5 3 1 3 7 10 7 6 8 9 9 10 6 8 15 12 14 11 13 14 11 12 13 15 16 16
  • 35. Small random inserts Inserting 3 billion rows Acunu powered Cassandra - ‘standard’ Cassandra -
  • 36. Insert latency While inserting 3 billion rows Acunu powered Cassandra x ‘standard’ Cassandra +
  • 37. Small random range queries Performed immediately after inserts Acunu powered Cassandra - ‘standard’ Cassandra -
  • 38. Performance summary Standard Acunu Benefits inserts rate ~32k/s ~45k/s >1.4x 95% latency ~32s ~0.3s >100x gets rate ~100/s ~350/s >3.5x 95% latency ~2s ~0.5s >4x range queries ~0.4/s ~40/s >100x 95% latency ~15s ~2s >7.5x
  • 39. • Castle: like BDB, but for Big Data • DA: transforms random IO into sequential IO • Snapshots & Clones: addressing real problems with new workloads • 2 orders of magnitude better performance and predictability
  • 40. Questions? Tom Wilkie @tom_wilkie tom@acunu.com http://bitbucket.org/acunu http://www.acunu.com/download http://www.acunu.com/insights
  • 41. References [LSM] The Log-Structured Merge-Tree (LSM-Tree) Patrick O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth O'Neil Stratified B-trees and versioned dictionaries, - Andy http://staff.ustc.edu.cn/~jpq/paper/flash/1996-The Twigg, Andrew Byde, Grzegorz Miłoś, Tim Moreton, %20Log-Structured%20Merge-Tree%20%28LSM- John Wilkes, Tom Wilkie, HotStorage’11 Tree%29.pdf http://www.usenix.org/event/hotstorage11/tech/ final_files/Twigg.pdf [COLA] Cache-Oblivious Streaming B-trees, Michael A. Bender et al [RDA] Random duplicate storage strategies for http://www.cs.sunysb.edu/~bender/newpub/ load balancing in multimedia servers, 2000, Joep BenderFaFi07.pdf Aerts and Jan Korst and Sebastian Egner http://www.win.tue.nl/~joep/IPL.ps [DSST] Making Data Structures Persistent - J. R. Driscoll, N. Sarnak, D. D. Sleator, R. E. Tarjan, Making Apache, Apache Cassandra, Cassandra, Hadoop, and Data Structures Persistent, Journal of Computer the eye and elephant logos are trademarks of the and System Sciences,Vol. 38, No. 1, 1989 Apache Software Foundation. http://www.cs.cmu.edu/~sleator/papers/making- data-structures-persistent.pdf