Durability for Memory-Based
     Key-Value Stores


       Kiarash Rezahanjani
             July 4, 2012




                              1
Durability
                          Data Store
                        (university , KTH )
set(university , UPC)




 Ack




get(university )




  UPC


                                              2
Durability
                         Data Store
set(university , UPC )


                                        Commodity
Ack


                                      Non Volatile




                                                     3
Durability
                     Data Store
set(myKey, U)


                                  Commodity
Ack




                                        4
Durability


  Seek time
        +
                      SLOW
Rotational time
      +           Write          Read
 Transfer time




                          Disk

                                        5
Cache in memory

Slow   Writes          Reads   Fast



                   Cached Objects


                                          Consistency ?
                Primary copy of objects

                                                     6
Cache in memory

        Stale data
            Application Servers


                Set ObjA             Read ObjA - > Cache Miss

                 Spending resouces
Read Obj A                                                      Memcache servers


Complicates development                                         Delete Obj A


Update Obj A
               Writes are still Slow
                                  MySQL Servers

                                                                                   7
Memory-Based Databases
No inconsistency Writes             Reads
                                             No stale data


 Reads are fast    Primary Copy of Objects
                    Durability?


                                    Writes latency?
                          Back up

                                                             8
Approaches towards durability

State A            State B     Periodic Snapshots   Data loss


Snapshot           Snapshot


                              Synchronous logging Slow


   Log       Log        Log



                              Asynchronous logging Data loss


          Logs       Logs
                                                            9
Approaches towards durability

                    Replica



                                  Expensive
                     Data

Catastrophic Failure , All gone
       Replica                     Replica




                                             10
Project Goals

          Durable write
           Low latency

Availability, able to recover quickly

 Cheap, commodity hardware


                                        11
Target systems
•   Data is big = many machines
•   Read dominant workload
•   Simple key-value store
•   Small writes
    – Example: Facebook
       •   Tera bytes of data = 2000 memcache servers
       •   Write/read ratio < 6%
       •   Memcache is a key-value store
       •   Status update, tag photo, profile update, etc

                                                           12
Solution




           13
Design decisions


  Periodic snapshot
       vs.
  Message logging     



                          14
Design decisions


    Local disk
       vs.
  Remote location   



                        15
Design decisions


      Remote file server
               vs.
Local disks of database cluster   



                                      16
Design Decision
                write


         Database
           client


  Ack               Log




        Remote storage
                          17
Design Decision
            write
                       Two Problems
      Database
       client           1) Synchronous logging

Ack              Log             Must
                          Asynchronous logging
                                 Problems: Data loss

                        2) Data availability

 Replication
                                                       18
Replication

                   Ack                  Log
Ack     Log


                 Log        Log   Log

Replication




                                              19
Replication
              Broadcast                              Chain replication


        Ack               Log           Ack                                Log



                mast
                 er                           tail                       head



slave                           slave




                                                                            20
Replication
          Broadcast


    Ack               Log


            mast
             er



slave                  slave



            slave

                               21
Replication
               Chain replication


Ack                                  Log




      tail                         head




                                           22
Replication
               Chain replication


                                          Log
Ack



      tail                         head




                                                23
Chain Replication
                       write


            Database
      Ack     client   Log




Log         Log                Log




                                     24
Chain Replication
Synchronous logging abstraction
                           write


Low latency             Database
                  Ack     client   Log


Available Logs


        Log             Log              Log
                  Stable Storage Unit

                                               25
Log Server


 Log




             26
Log Server
                                                       3        2 1
                                        Reader


           7
Receiver                6     5     3




                                        Persister


                 Sequential Write

                 Seek time

                                                 2 1
                                                           27
Forming storage units

1. Query zookeeper
                                    Zookeeper
2. Get list of servers
3. Leader send request
4. Leader send list of
  members
                              ID1     ID2       ID3
5. Upload storage unit data
6. Start the service
                                                 28
Storage System
                                 Zookeeper




Client


Client     Stable storage unit               Stable storage unit



Client




           Stable storage unit               Stable storage unit
                                                                   29
Failover
                          Cient




ID 1                              ID 2             ID 3
50%                               20%              30%




ID 4               ID 5                            ID 6
40%                45%                             20%



 Stable Storage Unit                Stable Storage Unit   30
Failover
                          Cient




ID 1                              ID 2             ID 3
50%                               20%              30%




ID 4               ID 5                            ID 6
40%                45%                             20%



 Stable Storage Unit                Stable Storage Unit   31
Failover
                          Cient




ID 1                              ID 2             ID 3
50%                               20%              30%




ID 4               ID 5                            ID 6
40%                45%                             20%



 Stable Storage Unit                Stable Storage Unit   32
Evaluation
• Throughput and latency of stable storage unit
  – Log entry sizes
  – Replication factors
• Comparison with WAL into local disk




                                                  33
Single synchronous client
             Replication factor of 3


Entry Size    Latency(ms)        Throughput(entries/sec)
(bytes)
200           0,45               2200
1024          0,62               1600
4096          0,99               1000




                                                           34
Throughput vs. Latency
                                          Replication factor of 3
               3500



               3000



               2500
Latency (ms)




               2000
                                                                                                             5B
                                                                                                             200 B
               1500                                                                                          1 KB
                          5000                                                                               4 KB

               1000                14000                        28000                                        10 KB

                                                                                 34000
               500



                  0
                      0    5000   10000   15000      20000      25000        30000   35000   40000   45000
                                                  Throughput (entries/sec)


                                                                                                                     35
Additional replica
                                                   Entry size of 200 bytes
                        2000

                        1800

                        1600

                        1400
Latency (microsecond)




                        1200

                        1000

                         800                                                                                  RF 3
                                                                                                              RF 2
                         600

                         400

                         200

                           0
                               0   5000    10000     15000       20000        25000   30000   35000   40000
                                                        Throughput (entries/sec)



                                                                                                                     36
Sustained load




                 37
‹#›
Resource utilization

• Throughput of 6,000 entries/sec
• Log entries of 200 bytes
  – CPU utilization = 9%
  – Bandwidth = 29 Mb/s
  – Dedicated disk
  – Small memory requirement


                                    39
Summary
 Durable write

 Low latency

 High availability

 Scalable

 No additional resources

  Avoid dependencies       40

Presentation

  • 1.
    Durability for Memory-Based Key-Value Stores Kiarash Rezahanjani July 4, 2012 1
  • 2.
    Durability Data Store (university , KTH ) set(university , UPC) Ack get(university ) UPC 2
  • 3.
    Durability Data Store set(university , UPC ) Commodity Ack Non Volatile 3
  • 4.
    Durability Data Store set(myKey, U) Commodity Ack 4
  • 5.
    Durability Seektime + SLOW Rotational time + Write Read Transfer time Disk 5
  • 6.
    Cache in memory Slow Writes Reads Fast Cached Objects Consistency ? Primary copy of objects 6
  • 7.
    Cache in memory Stale data Application Servers Set ObjA Read ObjA - > Cache Miss Spending resouces Read Obj A Memcache servers Complicates development Delete Obj A Update Obj A Writes are still Slow MySQL Servers 7
  • 8.
    Memory-Based Databases No inconsistencyWrites Reads No stale data Reads are fast Primary Copy of Objects Durability? Writes latency? Back up 8
  • 9.
    Approaches towards durability StateA State B Periodic Snapshots Data loss Snapshot Snapshot Synchronous logging Slow Log Log Log Asynchronous logging Data loss Logs Logs 9
  • 10.
    Approaches towards durability Replica Expensive Data Catastrophic Failure , All gone Replica Replica 10
  • 11.
    Project Goals Durable write Low latency Availability, able to recover quickly Cheap, commodity hardware 11
  • 12.
    Target systems • Data is big = many machines • Read dominant workload • Simple key-value store • Small writes – Example: Facebook • Tera bytes of data = 2000 memcache servers • Write/read ratio < 6% • Memcache is a key-value store • Status update, tag photo, profile update, etc 12
  • 13.
  • 14.
    Design decisions Periodic snapshot vs. Message logging  14
  • 15.
    Design decisions Local disk vs. Remote location  15
  • 16.
    Design decisions Remote file server vs. Local disks of database cluster  16
  • 17.
    Design Decision write Database client Ack Log Remote storage 17
  • 18.
    Design Decision write Two Problems Database client 1) Synchronous logging Ack Log Must Asynchronous logging Problems: Data loss 2) Data availability Replication 18
  • 19.
    Replication Ack Log Ack Log Log Log Log Replication 19
  • 20.
    Replication Broadcast Chain replication Ack Log Ack Log mast er tail head slave slave 20
  • 21.
    Replication Broadcast Ack Log mast er slave slave slave 21
  • 22.
    Replication Chain replication Ack Log tail head 22
  • 23.
    Replication Chain replication Log Ack tail head 23
  • 24.
    Chain Replication write Database Ack client Log Log Log Log 24
  • 25.
    Chain Replication Synchronous loggingabstraction write Low latency Database Ack client Log Available Logs Log Log Log Stable Storage Unit 25
  • 26.
  • 27.
    Log Server 3 2 1 Reader 7 Receiver 6 5 3 Persister Sequential Write Seek time 2 1 27
  • 28.
    Forming storage units 1.Query zookeeper Zookeeper 2. Get list of servers 3. Leader send request 4. Leader send list of members ID1 ID2 ID3 5. Upload storage unit data 6. Start the service 28
  • 29.
    Storage System Zookeeper Client Client Stable storage unit Stable storage unit Client Stable storage unit Stable storage unit 29
  • 30.
    Failover Cient ID 1 ID 2 ID 3 50% 20% 30% ID 4 ID 5 ID 6 40% 45% 20% Stable Storage Unit Stable Storage Unit 30
  • 31.
    Failover Cient ID 1 ID 2 ID 3 50% 20% 30% ID 4 ID 5 ID 6 40% 45% 20% Stable Storage Unit Stable Storage Unit 31
  • 32.
    Failover Cient ID 1 ID 2 ID 3 50% 20% 30% ID 4 ID 5 ID 6 40% 45% 20% Stable Storage Unit Stable Storage Unit 32
  • 33.
    Evaluation • Throughput andlatency of stable storage unit – Log entry sizes – Replication factors • Comparison with WAL into local disk 33
  • 34.
    Single synchronous client Replication factor of 3 Entry Size Latency(ms) Throughput(entries/sec) (bytes) 200 0,45 2200 1024 0,62 1600 4096 0,99 1000 34
  • 35.
    Throughput vs. Latency Replication factor of 3 3500 3000 2500 Latency (ms) 2000 5B 200 B 1500 1 KB 5000 4 KB 1000 14000 28000 10 KB 34000 500 0 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 Throughput (entries/sec) 35
  • 36.
    Additional replica Entry size of 200 bytes 2000 1800 1600 1400 Latency (microsecond) 1200 1000 800 RF 3 RF 2 600 400 200 0 0 5000 10000 15000 20000 25000 30000 35000 40000 Throughput (entries/sec) 36
  • 37.
  • 38.
  • 39.
    Resource utilization • Throughputof 6,000 entries/sec • Log entries of 200 bytes – CPU utilization = 9% – Bandwidth = 29 Mb/s – Dedicated disk – Small memory requirement 39
  • 40.
    Summary  Durable write Low latency  High availability  Scalable  No additional resources  Avoid dependencies 40

Editor's Notes

  • #3 Resume
  • #10 Periodicsnapshop: degrade the performance at the time of snapshot, generate load spikeon machine
  • #13 Important not to try to be all things to all people– Clients might be demanding 8 different things– Doing 6 of them is easy– …handling 7 of them requires real thought– …dealing with all 8 usually results in a worse system• more complex, compromises other clients in trying to satisfy everyoneE.g.Facebook 2008 – 800 memcache server – 2000 now &lt; 6% writeUpdatessmall (expecttag, addfriend, new ads, status, profileupdate, sharing)
  • #20 After log isreplicated in memory of several machines ackissendtotheclientIfsome of theprocessescrashsomeotherprocess in other machines willstillpersistthe dataSeveral replicas providebetteravailabilityof data at the time of recoveryAggregatethereadbandwidth of the servers toacceceleratetherecovery
  • #24 Adding replica doesnt introduce bottleneck and doesnotimpactthroughput
  • #30 Scalablility
  • #36 Replication factor of three
  • #39 Commonapproach WAL to local disk, Redisisanexample of a popular in memorydatabase uses WAL to diskToGuranteedurability of every log ,itshould be writtento disk uponeverywriteoperationEvenwhen log iswrittento disk thereis no guranteethatitispersisted disk, bacauseby default the disk caches are enabledProcesscrash 1.7 alsopoweroutage 49, no availabilityif server isdownOurs factor of 4 betterthan disk with cache disableSaturation can be prevented