A tunably consistent, highly-
available, Distributed Database

     Tom Wilkie @tom_wilkie
  Founder & VP Engineering, Acunu




                                    1
•   Overview
•   Distribution
•   Storage
•   Datamodel
•   Usecases




                   2
•   Overview
•   Distribution
•   Storage
•   Datamodel
•   Usecases




                   3
•   A distributed database for Big Data
    •   Scale out on commodity servers
    •   Best of bread performance
    •   Multi-master architecture, no SPOF
    •   Powerful multi data centre support



4

                                              4
5

    5
BigTable, 2006                        Dynamo, 2007




                 Open sourced, 2008




                                TLP, 2010
          Incubator, 2009       v1.0 2011
                                                     6
BigTable: ...
    •   Simple but powerful datamodel
    •   Write-optimised storage system
    •   Consistent, available but not partition tolerant
    •   Master-slave distribution system, SPOF




                                            http://goo.gl/7T1Ej

7

                                                                  7
Dynamo: ...
    •   Sophisticated distribution system with tradable
        consistency and availability
    •   Over-simple datamodel




                                          http://goo.gl/Q80b4

8

                                                                8
•   Overview
•   Distribution
•   Storage
•   Datamodel
•   Usecases




                   9
Distribution: Consistent Hashing
                                     →
                              r1, c1	

 v1
                                     →
                              r2, c2	

 v2
                                     →
                              r3, c3	

 v3




10

                                             10
Distribution: Scaling




11

                             11
Distribution: Scaling




12

                             12
Distribution: Scaling




     •   .




13

                                     13
Distribution: Scaling




14

                             14
Distribution: Scaling




15

                             15
Distribution: Replication
                                  →
                           r1, c1	

 v1




16

                                          16
Distribution: Replication




17

                                 17
Distribution: Consistency
     Tuneable, per-operation consistency
     Timestamped values, N > R + W




                  W            R

18

                                           18
Distribution: Read Repair




19

                                 19
Distribution: Read Repair




20

                                 20
Distribution: Read Repair




21

                                 21
Distribution: Read Repair




22

                                 22
•   Overview
•   Distribution
•   Storage
•   Datamodel
•   Usecases




                   23
Writing to Cassandra



     Row Key   Column   Column   Column   Column




24

                                                   24
Writing to Cassandra
     In the JVM

       Row   Colu   Colu   Colu   Colu
                                                  Memtable




      On disk                            Commit
                                           log

25

                                                             25
Writing to Cassandra
     In the JVM

                            Full Memtable




      On disk     Commit
                    log

26

                                            26
Writing to Cassandra
     In the JVM

                           New Memtable




      On disk     Commit
                              SSTable
                    log

27

                                          27
Writing to Cassandra

     On disk    Commit
                             SSTable
                  log
                             SSTable
                             SSTable
                             SSTable
                             SSTable
                             SSTable



28

                                       28
Writing to Cassandra

     On disk    Commit
                  log


                             SSTable




29

                                       29
Reading from Cassandra




                         30
2

 Off-heap                              Row cache
 (no GC)
                         1


 In the JVM                                Memtable
                         3                 4               5
                                                               SSTable
                             Bloom filter       Key cache
                                                                index

                         6



 On disk    Commit log                     SSTable

31
                                                                         31
•   Overview
•   Distribution
•   Storage
•   Datamodel
•   Usecases




                   32
SQL                                     Cassandra

     Database   row/key col_1    col_2
                                                 Keyspace
                   row/key col_1     col_1
                        row/  col_1    col_1


      Table                                    Column Family




33

                                                               33
col1   col2   col3   col4   col5   col6   col7
     row1           x                    x      x
     row2    x      x      x      x      x
     row3           x      x             x      x      x
     row4           x      x      x             x
     row5           x             x      x      x
     row6           x
     row7    x      x             x



34

                                                             34
alice: {
        m2: {
           Sender: bob,
           Subject: ‘paper!’, ...
        }
     }

     bob: {
        m1: {
            Sender: alice,
            Subject: ‘rock?’, ...
        }
     }

     charlie: {
        m1: {
           Sender: alice,
           Subject: ‘rock?’, ...
        },
        m2: {
           Sender: bob,
           Subject: ‘paper!’, ...
        }
     }




35

                                    35
•   Overview
•   Distribution
•   Storage
•   Datamodelling
•   Usecases




                    36
Perfect for high velocity data
               Web, SCM, Retail    Location Services   Cloud Monitoring




                   Social Gaming     Social Media      Ad Marketplaces




               Fraud Detection     Smart Metering      Oil/Gas Sensors


  37
 Confidential                               6
Wednesday, 25 April 12
                                                                          37
Not Covered...
     •   Distribution: Hinted Handoff, Anti-entropy repair,
         Counter distribution
     •   Storage: Counter storage, different compaction
         strategies, partitioning etc
     •   Datamodel: de-normalisation, TTLs, secondary
         indexes, CQL, super-columns, schema optional
     •   Operations: backup, nodetool, performance tuning
     •   Integration: Hadoop, Client Libraries etc
38

                                                              38
• Distributed, scalable database
• Opensource, widely used
• Tunably consistent
• Highly-available
• Partition tolerant
• Write-optimised
• Schema-optional
                                   39
Data Platform



                40
Data Platform
Data driven applications   Web UI



   Acunu Analytics

                           Control
  Apache Cassandra
                           Center

Acunu Storage Engine

       Configured and tuned OS


         Commodity Hardware


                                                     41
Control Center




“I've had the EC2 instance running for a little while and I
have to say, I'm impressed. You guys have done well with
                      this product.”
                                           - Lloyd, JustDevelopIt
                                                                    42
Control Center




“The new UI has been critical in helping us work out
           what is wrong in our code”
                                       - Matt, TellyBug
                                                            43
Castle: Built for Big Data
     •        Storage engine optimized for large slow disks,
              many cores, Big Data workloads
     •        Enterprise density on commodity hardware
     •        Lightning disk rebuilds:10x faster than RAID

                           Shared memory interface




                                                                                                                                                   Castle
                                                                      keys
                                                                                                                              Userspace
                                                                                                                            Acunu Kernel
         userspace
          interface




                                                               values
                                                                                                  In-kernel
                                         async, shared
                                          memory ring                                             workloads
                                                                 shared buffers
         kernelspace




                                                        Streaming interface
           interface




                             range           key               buffered              key           buffered
                            queries         insert            value insert           get           value get




                                                              Doubling Arrays                                                               •   Opensource (GPLv2, MIT
         doubling array
         mapping layer




                                                                                                                                                for user libraries)
                                         insert                                                                              Bloom filters
                                        queues                                                       key
                                                                                                     get
                            arrays                                                                                                  x
                             range                                        arrays
                            queries                                     management




                                                                                                                                                                             http://goo.gl/gzihe
                                         key




                                                                                                                                            •
                                        insert                          merges


                                                                                                                                                http://bitbucket.org/acunu
                                                                 Arrays
         mapping layer




                                                                                                                                            •
         modlist btree




                                          key                                                       Version tree


                                                                                                                                                Loadable Kernel Module,
                                         insert                          btree
                                                                                            key
                                                                                            get
                             btree


                                                                                                                                                targeting CentOS’s 2.6.18
                             range
                            queries                           value arrays



44
                                                                                                                                            •
                                                                                           Cache
         block mapping &




                                                                                                                                                http://www.acunu.com/
          cacheing layer




                                 "Extent" layer
                                                                                                               prefetcher




                                                                                 extent block
                                                   extent                           cache


                                                                                                                                                blogs/andy-twigg/why-
                            freespace
                                                  allocator
                             manager
                                                                                                                                                                                                   44
                                                                                                     flusher
45
Rebuild time
                            5


                            4
     Rebuild Time (Hours)




                            3


                            2


                            1


                            0   RAID10, 8 Disks   RAID5, 8 Disks   RDA, 8 Disks   RDA, 15 Disks




46

                                                                                                  46
Analytics

                                     counter
                                     updates
Click stream    events
                          Acunu
Sensor data
                         Analytics
     etc




     •   Simple, real-time, incremental analytics
     •   Push processing into ingest phase

                                                                47
Questions?
 tom@acunu.com
   @tom_wilkie
 www.acunu.com




                 48
Introduction



     Live & historical
       aggregates...




49

                                        49
Realtime trends...




50

                          50
Drill downs
     and roll ups


51

                    51
Solution              Con

                        Scalability
                          $$$


                        Not realtime
                 Inefficient Recomputation


                Spartan query semantics =>
                  complex, DIY solutions

52

                                             52

Progressive NOSQL: Cassandra

  • 1.
    A tunably consistent,highly- available, Distributed Database Tom Wilkie @tom_wilkie Founder & VP Engineering, Acunu 1
  • 2.
    Overview • Distribution • Storage • Datamodel • Usecases 2
  • 3.
    Overview • Distribution • Storage • Datamodel • Usecases 3
  • 4.
    A distributed database for Big Data • Scale out on commodity servers • Best of bread performance • Multi-master architecture, no SPOF • Powerful multi data centre support 4 4
  • 5.
    5 5
  • 6.
    BigTable, 2006 Dynamo, 2007 Open sourced, 2008 TLP, 2010 Incubator, 2009 v1.0 2011 6
  • 7.
    BigTable: ... • Simple but powerful datamodel • Write-optimised storage system • Consistent, available but not partition tolerant • Master-slave distribution system, SPOF http://goo.gl/7T1Ej 7 7
  • 8.
    Dynamo: ... • Sophisticated distribution system with tradable consistency and availability • Over-simple datamodel http://goo.gl/Q80b4 8 8
  • 9.
    Overview • Distribution • Storage • Datamodel • Usecases 9
  • 10.
    Distribution: Consistent Hashing → r1, c1 v1 → r2, c2 v2 → r3, c3 v3 10 10
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
    Distribution: Replication → r1, c1 v1 16 16
  • 17.
  • 18.
    Distribution: Consistency Tuneable, per-operation consistency Timestamped values, N > R + W W R 18 18
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
    Overview • Distribution • Storage • Datamodel • Usecases 23
  • 24.
    Writing to Cassandra Row Key Column Column Column Column 24 24
  • 25.
    Writing to Cassandra In the JVM Row Colu Colu Colu Colu Memtable On disk Commit log 25 25
  • 26.
    Writing to Cassandra In the JVM Full Memtable On disk Commit log 26 26
  • 27.
    Writing to Cassandra In the JVM New Memtable On disk Commit SSTable log 27 27
  • 28.
    Writing to Cassandra On disk Commit SSTable log SSTable SSTable SSTable SSTable SSTable 28 28
  • 29.
    Writing to Cassandra On disk Commit log SSTable 29 29
  • 30.
  • 31.
    2 Off-heap Row cache (no GC) 1 In the JVM Memtable 3 4 5 SSTable Bloom filter Key cache index 6 On disk Commit log SSTable 31 31
  • 32.
    Overview • Distribution • Storage • Datamodel • Usecases 32
  • 33.
    SQL Cassandra Database row/key col_1 col_2 Keyspace row/key col_1 col_1 row/ col_1 col_1 Table Column Family 33 33
  • 34.
    col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x x 34 34
  • 35.
    alice: { m2: { Sender: bob, Subject: ‘paper!’, ... } } bob: { m1: { Sender: alice, Subject: ‘rock?’, ... } } charlie: { m1: { Sender: alice, Subject: ‘rock?’, ... }, m2: { Sender: bob, Subject: ‘paper!’, ... } } 35 35
  • 36.
    Overview • Distribution • Storage • Datamodelling • Usecases 36
  • 37.
    Perfect for highvelocity data Web, SCM, Retail Location Services Cloud Monitoring Social Gaming Social Media Ad Marketplaces Fraud Detection Smart Metering Oil/Gas Sensors 37 Confidential 6 Wednesday, 25 April 12 37
  • 38.
    Not Covered... • Distribution: Hinted Handoff, Anti-entropy repair, Counter distribution • Storage: Counter storage, different compaction strategies, partitioning etc • Datamodel: de-normalisation, TTLs, secondary indexes, CQL, super-columns, schema optional • Operations: backup, nodetool, performance tuning • Integration: Hadoop, Client Libraries etc 38 38
  • 39.
    • Distributed, scalabledatabase • Opensource, widely used • Tunably consistent • Highly-available • Partition tolerant • Write-optimised • Schema-optional 39
  • 40.
  • 41.
    Data Platform Data drivenapplications Web UI Acunu Analytics Control Apache Cassandra Center Acunu Storage Engine Configured and tuned OS Commodity Hardware 41
  • 42.
    Control Center “I've hadthe EC2 instance running for a little while and I have to say, I'm impressed. You guys have done well with this product.” - Lloyd, JustDevelopIt 42
  • 43.
    Control Center “The newUI has been critical in helping us work out what is wrong in our code” - Matt, TellyBug 43
  • 44.
    Castle: Built forBig Data • Storage engine optimized for large slow disks, many cores, Big Data workloads • Enterprise density on commodity hardware • Lightning disk rebuilds:10x faster than RAID Shared memory interface Castle keys Userspace Acunu Kernel userspace interface values In-kernel async, shared memory ring workloads shared buffers kernelspace Streaming interface interface range key buffered key buffered queries insert value insert get value get Doubling Arrays • Opensource (GPLv2, MIT doubling array mapping layer for user libraries) insert Bloom filters queues key get arrays x range arrays queries management http://goo.gl/gzihe key • insert merges http://bitbucket.org/acunu Arrays mapping layer • modlist btree key Version tree Loadable Kernel Module, insert btree key get btree targeting CentOS’s 2.6.18 range queries value arrays 44 • Cache block mapping & http://www.acunu.com/ cacheing layer "Extent" layer prefetcher extent block extent cache blogs/andy-twigg/why- freespace allocator manager 44 flusher
  • 45.
  • 46.
    Rebuild time 5 4 Rebuild Time (Hours) 3 2 1 0 RAID10, 8 Disks RAID5, 8 Disks RDA, 8 Disks RDA, 15 Disks 46 46
  • 47.
    Analytics counter updates Click stream events Acunu Sensor data Analytics etc • Simple, real-time, incremental analytics • Push processing into ingest phase 47
  • 48.
    Questions? tom@acunu.com @tom_wilkie www.acunu.com 48
  • 49.
    Introduction Live & historical aggregates... 49 49
  • 50.
  • 51.
    Drill downs and roll ups 51 51
  • 52.
    Solution Con Scalability $$$ Not realtime Inefficient Recomputation Spartan query semantics => complex, DIY solutions 52 52