In Production: Portrait of a Successful Failure Sean Cribbs @seancribbs  [email_address]
Riak is... a scalable, highly-available, networked key/value store.
Riak Data Model Riak stores values against keys Encode your data how you like it Keys are grouped into buckets
Basic Operations GET /buckets/B/keys/K PUT /buckets/B/keys/K DELETE /buckets/B/keys/K
Extras MapReduce, Link-walking Value Metadata Secondary Indexes Full-text Search Configurable Storage Engines Admin GUI
When things go wrong A Real Customer Story
Situation You have cluster Things are great It’s time to add capacity
Solution Add a new node
Hostnames This customer named nodes after drinks: Aston IPA Highball Gin Framboise ESB
riak-admin join With Riak, it’s easy to add a new node. on aston: $ riak-admin join  [email_address] Then you leave for a quick lunch.
This can’t be good...
Quick, what do you do? add another system! shutdown the entire site! alert Basho Support via an URGENT ticket
Control the situation Stop the handoff between nodes on every node we: riak attach application:set_env(riak_core, handoff_concurrency, 0).
Monitor
...for signs of...
Stabilization
Now what? What happened? Why did it happen? Can we fix this situation?
But first Are you still operational? yes Any noticeable changes in service latency? no Have any nodes failed? no ,  the cluster is still servicing requests .
So what happened?! New node added Ring must rebalance Nodes claim partitions Handoff of data begins Disks fill up
Member Status First let’s peek under the hood. $ riak-admin member_status ================================= Membership ================================Status  Ring  Pending  Node-----------------------------------------------------------------------------valid  4.3%  16.8%  riak@astonvalid  18.8%  16.8%  riak@esbvalid  19.1%  16.8%  riak@framboisevalid  19.5%  16.8%  riak@ginvalid  19.1%  16.4%  riak@highballvalid  19.1%  16.4%  riak@ipa-----------------------------------------------------------------------------Valid:6 / Leaving:0 / Exiting:0 / Joining:0 / Down:0
Relief Let’s try to relieve the pressure a bit Focus on the node with the least disk space left. gin:~$ riak attach application:set_env(riak_core, forced_ownership_handoff, 0). application:set_env(riak_core, vnode_inactivity_timeout, 300000). application:set_env(riak_core, handoff_concurrency, 1).  riak_core_vnode:trigger_handoff(element(2, riak_core_vnode_master:get_vnode_pid(411047335499316445744786359201454599278231027712, riak_kv_vnode))).
Relief It took 20 minutes to transfer the vnode (riak@gin)7> 19:34:00.574 [info] Starting handoff of partition riak_kv_vnode 411047335499316445744786359201454599278231027712 from riak@gin to riak@aston gin:~$ sudo netstat -nap | fgrep 10.36.18.245  tcp  0  1065 10.36.110.79:40532  10.36.18.245:8099  ESTABLISHED 27124/beam.smp  tcp  0  0 10.36.110.79:46345  10.36.18.245:53664  ESTABLISHED 27124/beam.smp (riak@gin)7> 19:54:56.721 [info] Handoff of partition riak_kv_vnode 411047335499316445744786359201454599278231027712 from riak@gin to riak@aston completed: sent 3805730 objects in 1256.14 seconds
Relief And the vnode had arrived at Aston from Gin aston:/data/riak/bitcask/205523667749658222872393179600727299639115513856-132148847970820$ ls -la total 7305344 drwxr-xr-x  2 riak riak  4096 2011-11-11 18:05 . drwxr-xr-x 258 riak riak  36864 2011-11-11 18:56 .. -rw-------  1 riak riak 2147479761 2011-11-11 17:53 1321055508.bitcask.data -rw-r--r--  1 riak riak  86614226 2011-11-11 17:53 1321055508.bitcask.hint -rw-------  1 riak riak 1120382399 2011-11-11 19:50 1321055611.bitcask.data -rw-r--r--  1 riak riak  55333675 2011-11-11 19:50 1321055611.bitcask.hint -rw-------  1 riak riak 2035568266 2011-11-11 18:03 1321056070.bitcask.data -rw-r--r--  1 riak riak  99390277 2011-11-11 18:03 1321056070.bitcask.hint -rw-------  1 riak riak 1879298219 2011-11-11 18:05 1321056214.bitcask.data -rw-r--r--  1 riak riak  56509595 2011-11-11 18:05 1321056214.bitcask.hint -rw-------  1 riak riak  119 2011-11-11 17:53 bitcask.write.lock
Eureka! Data was not being cleaned up after handoff. This would eventually eat all disk space!
What’s the solution? We already had a bugfix for the next release (1.0.2) that detects the problem Tested the bugfix locally before delivering to customer
Hot Patch We patched their live, production system  while still under load. (on all nodes)  riak attachl(riak_kv_bitcask_backend).m(riak_kv_bitcask_backend).Module riak_kv_bitcask_backend compiled: Date: November 12 2011, Time: 04.18Compiler options:  [{outdir,"ebin"},  debug_info,warnings_as_errors,  {parse_transform,lager_transform},  {i,"include"}]Object file: /usr/lib/riak/lib/riak_kv-1.0.1/ebin/riak_kv_bitcask_backend.beamExports: api_version/0  is_empty/1callback/3  key_counts/0delete/4  key_counts/1drop/1  module_info/0fold_buckets/4  module_info/1fold_keys/4  put/5fold_objects/4  start/2get/3  status/1...
Bingo! And the new code did what we expected. {ok, R} = riak_core_ring_manager:get_my_ring().[riak_core_vnode_master:get_vnode_pid(Partition, riak_kv_vnode) || {Partition,_} <- riak_core_ring:all_owners(R)].(riak@gin)19> [riak_core_vnode_master:get_vnode_pid(Partition, riak_kv_vnode) || {Partition,_} <- riak_core_ring:all_owners(R)].22:48:07.423 [notice]  Unused data directories exist for partition &quot;11417981541647679048466287755595961091061972992 &quot;: &quot;/data/riak/bitcask/11417981541647679048466287755595961091061972992&quot;22:48:07.785 [notice]  Unused data directories exist for partition &quot;582317058624031631471780675535394015644160622592 &quot;: &quot;/data/riak/bitcask/582317058624031631471780675535394015644160622592&quot;22:48:07.829 [notice]  Unused data directories exist for partition &quot;782131735602866014819940711258323334737745149952 &quot;: &quot;/data/riak/bitcask/782131735602866014819940711258323334737745149952&quot;[{ok,<0.30093.11>},...
Manual Cleanup So we backed up those vnodes with unused data on Gin to another system and manually removed them. gin:/data/riak/bitcask$ ls manual_cleanup/ 11417981541647679048466287755595961091061972992  782131735602866014819940711258323334737745149952582317058624031631471780675535394015644160622592 gin:/data/riak/bitcask$ rm -rf manual_cleanup
Gin’s Status Improves
Bedtime It was late at night, things were stable and the customer’s users were unaffected. We all went to bed, and didn’t reconvene for 12 hours.
Next Day’s Plan Start up handoff on the node with the lowest disk space let it move data 1 partition at a time to other nodes observe that data directories were removed after successful transfers complete When disk space frees up a bit, start up other nodes, increase handoff concurrency, watch the ring rebalance.
Let’s Get Started On Gin only: reset to defaults, re-enable handoffs on gin: application:unset_env(riak_core, forced_ownership_handoff). application:set_env(riak_core, vnode_inactivity_timeout, 60000). application:set_env(riak_core, handoff_concurrency, 1).
Gin Moves Data to IPA
Highball’s Turn Highball was next lowest now that Gin was handing data off, time to restart it too. on highball application:unset_env(riak_core, forced_ownership_handoff).application:set_env(riak_core, vnode_inactivity_timeout, 60000).application:set_env(riak_core, handoff_concurrency, 1). on gin application:set_env(riak_core, handoff_concurrency, 4). % the default setting riak_core_vnode_manager:force_handoffs().
Rebalance Starts
and keeps going...
and going...
and going...
Rebalanced
Minimal Impact 6ms variance for 99th % (32ms to 38ms) 0.68s variance for 100th % (0.12s to 0.8s)
Moral of the Story Riak’s resilience under stress resulted in minimal operational impact Hot code-patching solved the problem  in-situ , without downtime We all got some sleep!
Things break, Riak  bends . .
Thank You http://basho.com/resources/downloads/ https://github.com/basho/riak/ [email_address]

Riak a successful failure

  • 1.
    In Production: Portraitof a Successful Failure Sean Cribbs @seancribbs [email_address]
  • 2.
    Riak is... ascalable, highly-available, networked key/value store.
  • 3.
    Riak Data ModelRiak stores values against keys Encode your data how you like it Keys are grouped into buckets
  • 4.
    Basic Operations GET/buckets/B/keys/K PUT /buckets/B/keys/K DELETE /buckets/B/keys/K
  • 5.
    Extras MapReduce, Link-walkingValue Metadata Secondary Indexes Full-text Search Configurable Storage Engines Admin GUI
  • 6.
    When things gowrong A Real Customer Story
  • 7.
    Situation You havecluster Things are great It’s time to add capacity
  • 8.
  • 9.
    Hostnames This customernamed nodes after drinks: Aston IPA Highball Gin Framboise ESB
  • 10.
    riak-admin join WithRiak, it’s easy to add a new node. on aston: $ riak-admin join [email_address] Then you leave for a quick lunch.
  • 11.
  • 12.
    Quick, what doyou do? add another system! shutdown the entire site! alert Basho Support via an URGENT ticket
  • 13.
    Control the situationStop the handoff between nodes on every node we: riak attach application:set_env(riak_core, handoff_concurrency, 0).
  • 14.
  • 15.
  • 16.
  • 17.
    Now what? Whathappened? Why did it happen? Can we fix this situation?
  • 18.
    But first Areyou still operational? yes Any noticeable changes in service latency? no Have any nodes failed? no , the cluster is still servicing requests .
  • 19.
    So what happened?!New node added Ring must rebalance Nodes claim partitions Handoff of data begins Disks fill up
  • 20.
    Member Status Firstlet’s peek under the hood. $ riak-admin member_status ================================= Membership ================================Status Ring Pending Node-----------------------------------------------------------------------------valid 4.3% 16.8% riak@astonvalid 18.8% 16.8% riak@esbvalid 19.1% 16.8% riak@framboisevalid 19.5% 16.8% riak@ginvalid 19.1% 16.4% riak@highballvalid 19.1% 16.4% riak@ipa-----------------------------------------------------------------------------Valid:6 / Leaving:0 / Exiting:0 / Joining:0 / Down:0
  • 21.
    Relief Let’s tryto relieve the pressure a bit Focus on the node with the least disk space left. gin:~$ riak attach application:set_env(riak_core, forced_ownership_handoff, 0). application:set_env(riak_core, vnode_inactivity_timeout, 300000). application:set_env(riak_core, handoff_concurrency, 1). riak_core_vnode:trigger_handoff(element(2, riak_core_vnode_master:get_vnode_pid(411047335499316445744786359201454599278231027712, riak_kv_vnode))).
  • 22.
    Relief It took20 minutes to transfer the vnode (riak@gin)7> 19:34:00.574 [info] Starting handoff of partition riak_kv_vnode 411047335499316445744786359201454599278231027712 from riak@gin to riak@aston gin:~$ sudo netstat -nap | fgrep 10.36.18.245 tcp 0 1065 10.36.110.79:40532 10.36.18.245:8099 ESTABLISHED 27124/beam.smp tcp 0 0 10.36.110.79:46345 10.36.18.245:53664 ESTABLISHED 27124/beam.smp (riak@gin)7> 19:54:56.721 [info] Handoff of partition riak_kv_vnode 411047335499316445744786359201454599278231027712 from riak@gin to riak@aston completed: sent 3805730 objects in 1256.14 seconds
  • 23.
    Relief And thevnode had arrived at Aston from Gin aston:/data/riak/bitcask/205523667749658222872393179600727299639115513856-132148847970820$ ls -la total 7305344 drwxr-xr-x 2 riak riak 4096 2011-11-11 18:05 . drwxr-xr-x 258 riak riak 36864 2011-11-11 18:56 .. -rw------- 1 riak riak 2147479761 2011-11-11 17:53 1321055508.bitcask.data -rw-r--r-- 1 riak riak 86614226 2011-11-11 17:53 1321055508.bitcask.hint -rw------- 1 riak riak 1120382399 2011-11-11 19:50 1321055611.bitcask.data -rw-r--r-- 1 riak riak 55333675 2011-11-11 19:50 1321055611.bitcask.hint -rw------- 1 riak riak 2035568266 2011-11-11 18:03 1321056070.bitcask.data -rw-r--r-- 1 riak riak 99390277 2011-11-11 18:03 1321056070.bitcask.hint -rw------- 1 riak riak 1879298219 2011-11-11 18:05 1321056214.bitcask.data -rw-r--r-- 1 riak riak 56509595 2011-11-11 18:05 1321056214.bitcask.hint -rw------- 1 riak riak 119 2011-11-11 17:53 bitcask.write.lock
  • 24.
    Eureka! Data wasnot being cleaned up after handoff. This would eventually eat all disk space!
  • 25.
    What’s the solution?We already had a bugfix for the next release (1.0.2) that detects the problem Tested the bugfix locally before delivering to customer
  • 26.
    Hot Patch Wepatched their live, production system while still under load. (on all nodes) riak attachl(riak_kv_bitcask_backend).m(riak_kv_bitcask_backend).Module riak_kv_bitcask_backend compiled: Date: November 12 2011, Time: 04.18Compiler options: [{outdir,&quot;ebin&quot;}, debug_info,warnings_as_errors, {parse_transform,lager_transform}, {i,&quot;include&quot;}]Object file: /usr/lib/riak/lib/riak_kv-1.0.1/ebin/riak_kv_bitcask_backend.beamExports: api_version/0 is_empty/1callback/3 key_counts/0delete/4 key_counts/1drop/1 module_info/0fold_buckets/4 module_info/1fold_keys/4 put/5fold_objects/4 start/2get/3 status/1...
  • 27.
    Bingo! And thenew code did what we expected. {ok, R} = riak_core_ring_manager:get_my_ring().[riak_core_vnode_master:get_vnode_pid(Partition, riak_kv_vnode) || {Partition,_} <- riak_core_ring:all_owners(R)].(riak@gin)19> [riak_core_vnode_master:get_vnode_pid(Partition, riak_kv_vnode) || {Partition,_} <- riak_core_ring:all_owners(R)].22:48:07.423 [notice] Unused data directories exist for partition &quot;11417981541647679048466287755595961091061972992 &quot;: &quot;/data/riak/bitcask/11417981541647679048466287755595961091061972992&quot;22:48:07.785 [notice] Unused data directories exist for partition &quot;582317058624031631471780675535394015644160622592 &quot;: &quot;/data/riak/bitcask/582317058624031631471780675535394015644160622592&quot;22:48:07.829 [notice] Unused data directories exist for partition &quot;782131735602866014819940711258323334737745149952 &quot;: &quot;/data/riak/bitcask/782131735602866014819940711258323334737745149952&quot;[{ok,<0.30093.11>},...
  • 28.
    Manual Cleanup Sowe backed up those vnodes with unused data on Gin to another system and manually removed them. gin:/data/riak/bitcask$ ls manual_cleanup/ 11417981541647679048466287755595961091061972992 782131735602866014819940711258323334737745149952582317058624031631471780675535394015644160622592 gin:/data/riak/bitcask$ rm -rf manual_cleanup
  • 29.
  • 30.
    Bedtime It waslate at night, things were stable and the customer’s users were unaffected. We all went to bed, and didn’t reconvene for 12 hours.
  • 31.
    Next Day’s PlanStart up handoff on the node with the lowest disk space let it move data 1 partition at a time to other nodes observe that data directories were removed after successful transfers complete When disk space frees up a bit, start up other nodes, increase handoff concurrency, watch the ring rebalance.
  • 32.
    Let’s Get StartedOn Gin only: reset to defaults, re-enable handoffs on gin: application:unset_env(riak_core, forced_ownership_handoff). application:set_env(riak_core, vnode_inactivity_timeout, 60000). application:set_env(riak_core, handoff_concurrency, 1).
  • 33.
  • 34.
    Highball’s Turn Highballwas next lowest now that Gin was handing data off, time to restart it too. on highball application:unset_env(riak_core, forced_ownership_handoff).application:set_env(riak_core, vnode_inactivity_timeout, 60000).application:set_env(riak_core, handoff_concurrency, 1). on gin application:set_env(riak_core, handoff_concurrency, 4). % the default setting riak_core_vnode_manager:force_handoffs().
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
    Minimal Impact 6msvariance for 99th % (32ms to 38ms) 0.68s variance for 100th % (0.12s to 0.8s)
  • 41.
    Moral of theStory Riak’s resilience under stress resulted in minimal operational impact Hot code-patching solved the problem in-situ , without downtime We all got some sleep!
  • 42.
  • 43.
    Thank You http://basho.com/resources/downloads/https://github.com/basho/riak/ [email_address]

Editor's Notes

  • #2 Thank you to Greg Burd for most of these slides. He was going to give the presentation, but did not feel well enough to be here tonight.
  • #25 Gin had not removed that vnode’s data directory after sending it to Aston.We had confirmation, data was not being removed after transfers finished.This would have eventually eaten all space on all nodes and halted the cluster.
  • #26 We already had a solution ready for 1.0.2 which would properly identify any orphaned vnodes, why not simply use that?So we tested that on our laptops, creating a close approximation of the customer’s environment.
  • #31 At this point it was late at night, the cluster was servicing requests as always and customers had no idea anything was wrong.We all went to bed.And didn’t reconvene for 12 hours.
  • #33 On Gin only, we reset things we’d changed to default values and then re-enabled handoffs.