1,000,000 daily users and no cache
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

1,000,000 daily users and no cache

on

  • 6,735 views

Online games pose a few interesting challenges on their backend: A single user generates one http call every few seconds and the balance between data read and write is close to 50/50 which makes the ...

Online games pose a few interesting challenges on their backend: A single user generates one http call every few seconds and the balance between data read and write is close to 50/50 which makes the use of a write through cache or other common scaling approaches less effective.

Starting from a rather classic rails application as the traffic grew we gradually changed it in order to meet the required performance. And when small changes no longer were enough we turned inside out parts of our data persistency layer migrating from SQL to NoSQL without taking downtimes longer than a few minutes.

Follow the problems we hit, how we diagnosed them, and how we got around limitations. See which tools we found useful and which other lessons we learned by running the system with a team of just two developers without a sysadmin or operation team as support.

Statistics

Views

Total Views
6,735
Views on SlideShare
6,301
Embed Views
434

Actions

Likes
21
Downloads
175
Comments
1

9 Embeds 434

http://en.oreilly.com 316
http://intra.bunnius.com 95
http://www.schoox.com 9
http://janitorialp2.wordpress.com 5
http://intra.gennico.com 4
http://twitter.com 2
http://us-w1.rockmelt.com 1
https://twitter.com 1
http://a0.twimg.com 1
More...

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Thanks for sharing these experiences! All the best!
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

1,000,000 daily users and no cache Presentation Transcript

  • 1. O’Reilly  RailsConf,  2011-­‐05-­‐18
  • 2. Who  is  that  guy? Jesper  Richter-­‐Reichhelm  /  @jrirei Berlin,  Germany Head  of  Engineering  @  wooga
  • 3. Wooga  does  social  games
  • 4. Wooga  has  dedicated  game  teams Cooming soon PHP Ruby Ruby Erlang Cloud Cloud Bare  metal Cloud
  • 5. Flash  client  sends  state  changes  to  backend Flash  client Ruby  backend
  • 6. Social  games  need  to  scale  quite  a  bit 1  million  daily  users 200  million  daily  HTTP  requests
  • 7. Social  games  need  to  scale  quite  a  bit 1  million  daily  users 200  million  daily  HTTP  requests 100,000  DB  operaGons  per  second 40,000  DB  updates  per  second
  • 8. Social  games  need  to  scale  quite  a  bit 1  million  daily  users 200  million  daily  HTTP  requests 100,000  DB  operaGons  per  second 40,000  DB  updates  per  second …  but  no  cache
  • 9. A  journey  to  1,000,000  daily  users Start  of  the  journey 6  weeks  of  pain Paradise Looking  back
  • 10. We  used  two  MySQL  shards  right  from  the  start data_fabric  gem Easy  to  use Sharding  by  Facebook’s  user  id Sharding  on  controller  level
  • 11. Scaling  applicaLon  servers  was  very  easy Running  at  Amazon  EC2 Scalarium  for  automaGon  and  deployment Role  based Chef  recipes  for  customiza5on Scaling  app  servers  up  and  out  was  simple
  • 12. Master-­‐slave  replicaLon  for  DBs  worked  fine lb app app app db db db db slave slave
  • 13. We  added  a  few  applicaLon  servers  over  Lme lb app app app app app app app app app db db db db slave slave
  • 14. Basic  setup  worked  well  for  3  months May Jun Jul Aug Sep Oct Nov Dec
  • 15. A  journey  to  1,000,000  daily  users Start  of  the  journey 6  weeks  of  pain Paradise Looking  back
  • 16. SQL  queries  generated  by  Rubyamf  gem Rubyamf  serializes  returned  objects Wrong  config  =>  it  also  loads  associaGons
  • 17. SQL  queries  generated  by  Rubyamf  gem Rubyamf  serializes  returned  objects Wrong  config  =>  it  also  loads  associaGons Really  bad  but  very  easy  to  fix
  • 18. New  Relic  shows  what  happens  during  an  API  call
  • 19. More  traffic  using  the  same  cluster lb app app app app app app app app app db db db db slave slave
  • 20. Our  DB  problems  began May Jun Jul Aug Sep Oct Nov Dec
  • 21. MySQL  hiccups  were  becoming  more  frequent Everything  was  running  fine  ... ...  and  then  DB  throughput  went  down  to  10% A]er  a  few  minutes  everything  stabilizes  again  ... ...  just  to  repeat  the  cycle  20  minutes  later!
  • 22. AcLveRecord’s  checks  caused  20%  extra  DB  load Check  connecGon  state  before  each  SQL  query MySQL  process  list  full  of  ‘status’  calls
  • 23. AcLveRecord’s  status  checks  caused  20%  extra  DB   Check  connecGon  state  before  each  SQL  query MySQL  process  list  full  of  ‘status’  calls
  • 24. I/O  on  MySQL  masters  sLll  was  the  boYleneck New  Relic  shows  DB  table  usage 60%  of  all  UPDATEs  were  done  on  the  ‘Gles’  table
  • 25. New  Relic  shows  most  used  tables
  • 26. Tiles  are  part  of  the  core  game  loop Core  game  loop: 1)  plant 2)  wait 3)  harvest All  are   operaGons on  Tiles  table.
  • 27. We  started  to  shard  on  model,  too We  put  this  table  in  2  extra  shards old   old   master slave
  • 28. We  started  to  shard  on  model,  too We  put  this  table  in  2  extra  shards 1)  Setup  new  masters  as  slaves  of  old  ones old   old   new   master slave master
  • 29. We  started  to  shard  on  model,  too We  put  this  table  in  2  extra  shards 1)  Setup  new  masters  as  slaves  of  old  ones old   old   new   new   master slave master slave
  • 30. We  started  to  shard  on  model,  too We  put  this  table  in  2  extra  shards 1)  Setup  new  masters  as  slaves  of  old  ones 2)  App  servers  start  using  new  masters,  too old   old   new   new   master slave master slave
  • 31. We  started  to  shard  on  model,  too We  put  this  table  in  2  extra  shards 1)  Setup  new  masters  as  slaves  of  old  ones 2)  App  servers  start  using  new  masters,  too 3)  Cut  replica5on old   old   new   new   master slave master slave
  • 32. We  started  to  shard  on  model,  too We  put  this  table  in  2  extra  shards 1)  Setup  new  masters  as  slave  of  old  ones 2)  App  servers  start  using  new  masters,  too 3)  Cut  replica5on 4)  Truncate  not-­‐used  tables old   old   new   new   master slave master slave
  • 33. 8  DBs  and  a  few  more  servers lb app app app app app app app app app app app app app app app app 5les 5les db db db db db db 5les 5les slave slave slave slave
  • 34. Doubling  the  amount  of  DBs  didn’t  fix  it May Jun Jul Aug Sep Oct Nov Dec
  • 35. We  improved  our  MySQL  setup RAID-­‐0  of  EBS  volumes Using  XtraDB  instead  of  vanilla  MySQL Tweaking  my.cnf innodb_flush_log_at_trx_commit innodb_flush_method
  • 36. Data-­‐fabric  gem  circumvented  AR’s  internal  cache 2x  Tile.find_by_id(id)  =>  1x  SELECT  …
  • 37. Data-­‐fabric  gem  circumvented  AR’s  internal  cache 2x  Tile.find_by_id(id)  =>  1x  SELECT  …
  • 38. 2  +  2  masters  and  sLll  I/O  was  not  fast  enough We  were  geing  desperate: “If  2  +  2  is  not  enough,  ... …  perhaps  4  +  4  masters  will  do?”
  • 39. It’s  no  fun  to  handle  16  MySQL  DBs lb app app app app app app app app app app app app app app app app app app 5les 5les db db db db db db 5les 5les slave slave slave slave
  • 40. It’s  no  fun  to  handle  16  MySQL  DBs lb app app app app app app app app app app app app app app app app app app 5les 5les 5les 5les db db db db db db db db db db db db 5les 5les 5les 5les slave slave slave slave slave slave slave slave
  • 41. We  were  at  a  dead  end  with  MySQL May Jun Jul Aug Sep Oct Nov Dec
  • 42. I/O  remained  the  boYleneck  for  MySQL  UPDATEs Peak  throughput  overall:  5,000  writes/s Peak  throughput  single  master:  850  writes/s We  could  get  to  ~1,000  writes/s  … …  but  then  what?
  • 43. Pick  the  right  tool  for  the  job!
  • 44. Redis  is  fast  but  goes  beyond  simple  key/value Redis  is  a  key-­‐value  store Hashes,  Sets,  Sorted  Sets,  Lists Atomic  opera5ons  like  set,  get,  increment 50,000  transacGons/s  on  EC2 Writes  are  as  fast  as  reads
  • 45. Shelf  Lles  :  An  ideal  candidate  for  using  Redis   Shelf  Lles: {  plant1  =>  184, plant2  =>  141, plant3  =>  130, plant4  =>  112, …  }
  • 46. Shelf  Lles  :  An  ideal  candidate  for  using  Redis   Redis  Hash HGETALL:  load  whole  hash HGET:  load  a  single  key HSET:  set  a  single  key HINCRBY:  increase  value  of  single  key …
  • 47. Migrate  on  the  fly  when  accessing  new  model
  • 48. Migrate  on  the  fly  -­‐  but  only  once true  if  id  could  be  added else  false
  • 49. Migrate  on  the  fly  -­‐  and  clean  up  later 1. Let  this  running  everything  cools  down 2. Then  migrate  the  rest  manually 3. Remove  the  migraGon  code 4. Wait  unGl  no  fallback  necessary 5. Remove  the  SQL  table
  • 50. MigraLons  on  the  fly  all  look  the  same Typical  migraGon  throughput  over  3  days Ini5al  peak  at  100  migra5ons  /  second
  • 51. A  journey  to  1,000,000  daily  users Start  of  the  journey 6  weeks  of  pain Paredise  (or  not?) Looking  back
  • 52. Again:  Tiles  are  part  of  the  core  game  loop Core  game  loop: 1)  plant 2)  wait 3)  harvest All  are   operaGons on  Tiles  table.
  • 53. Size  maYers  for  migraLons MigraGon  check  overloaded  Redis Migra5on  only  on  startup We  overlooked  an  edge  case Next  5me  only  migrate  1%  of  users Migrate  the  rest  when  everything  worked  out
  • 54. In-­‐memory  DBs  don’t  like  to  saving  to  disk You  sGll  need  to  write  data  to  disk  eventually SAVE  is  blocking BGSAVE  needs  free  RAM Dumping  on  master  increased  latency  by  100%
  • 55. In-­‐memory  DBs  don’t  like  to  dump  to  disk You  sGll  need  to  write  data  to  disk  eventually SAVE  is  blocking BGSAVE  needs  free  RAM Dumping  on  master  increased  latency  by  100% Running  BGSAVE  slaves  every  15  minutes
  • 56. ReplicaLon  puts  Redis  master  under  load ReplicaGon  on  master  starts  with  a  BGSAVE Not  enough  free  RAM  =>  big  problem Master  queue  requests  that  slave  cannot  process Restoring  dump  on  slaves  is  blocking
  • 57. Redis  has  a  memory  fragmenLon  problem 44  GB in  8  days 24  GB
  • 58. Redis  has  a  memory  fragmenLon  problem 38  GB in  3  days 24  GB
  • 59. If  MySQL  is  a  truck... Fast  enough  for  reads Can  store  on  disk Robust  replicaGon hp://www.flickr.com/photos/erix/245657047/
  • 60. If  MySQL  is  a  truck,  Redis  is  a  Ferrari Fast  enough  for  reads Super  fast  reads/writes Can  store  on  disk Out  of  memory  =>  dead Robust  replicaGon Fragile  replicaGon hp://www.flickr.com/photos/erix/245657047/
  • 61. Big    and  staLc  data  in  MySQL,  rest  goes  to  Redis Fast  enough  for  reads Super  fast  reads/writes Can  store  on  disk Out  of  memory  =>  dead Robust  replicaGon Fragile  replicaGon 256  GB  data 60  GB  data 10%  writes 50%  writes hp://www.flickr.com/photos/erix/245657047/
  • 62. 95%  of  operaLon  effort  is  handling  DBs! lb lbapp app app app app app app app app app app app appapp app app app app app app app app app app app appapp app app app app app app app app app app app app db db db db db redis redis redis redis redis db db db db db redis redis redis redis redisslave slave slave slave slave slave slave slave slave slave
  • 63. We  fixed  all  our  problem  -­‐  we  were  in  Paradise! May Jun Jul Aug Sep Oct Nov Dec
  • 64. A  journey  to  1,000,000  daily  users Start  of  the  journey 6  weeks  of  pain Paredise  (or  not?) Looking  back
  • 65. Data  handling  is  most  important How  much  data  will  you  have? How  o]en  will  you  read/update  that  data? What  data  is  in  the  client,  what  in  the  server? It’s  hard  to  “refactor”  data  later  on
  • 66. Monitor  and  measure  your  applicaLon  closely On  applicaGon  level  /  on  OS  level Learn  your  app’s  normal  behavior Store  historical  data  so  you  can  compare  later React  if  necessary!
  • 67. Listen  closely  and  answer  truthfully Q  &  A Jesper  Richter-­‐Reichhelm @jrirei slideshare.net/wooga
  • 68. If  you  are  in  Berlin  just  drop  by Thank  you Jesper  Richter-­‐Reichhelm @jrirei slideshare.net/wooga