Ammon Bartram• Wrote the iOs Socialcam app• Previously lead video engineer at  Justin.tv• Used to own a cow named Snork  M...
Guillaume Luccisano• Built the Socialcam backend• Previously worked on Rails  performance at Justin.tv• French dude explor...
We grew explosively ... and survived     Here is how
We’d just finishedY-Combinator, and were    growing quickly
Simple rails stack, 15servers in San Francisco,  Postgres/Mongo dbs
Everything was working
Then...
BOOM!True viral growth
30k new users an hour and growing...        Things start to break
Not enough CPU!
Day 1
Add boxes in the colo?We can’t. Little space in thecolo. Servers take 2 weeks          to arrive
Add EC2 instances?How can servers on AWS talk to our           database?          SSH tunnels!           Then VPN!
Are video comments     essential?        No   OK, kill them!
Is non US traffic essential?            No         OK, kill it
Yeah Baby!Kill switches are born!
Barebone deployment...Push code live to production          boxes(And no more test suite :p)
It’s 9am, time to sleep... Got back to work the same day at 11am...   Site is up, but not        healthy!
Day 2
Move things to workers!
Add more monitoring
Add instancesAmazon has resource limits. We hitthem. Emergency late-night calls.         Thanks Tim!
DB is breakingToo many reads!
Buy more  time!
We run our main db inour own colo - a big SSD          box
We order faster drives and more memory      (on amazon.com :p)
And take down the site to        upgrade    It’s bad, but sometimes, it’s ok!
Caching!
Purpose of caching:Minimize backend requests per user             request
Common caching  techniques:Static full-page caching Server side includes      Ajax loading  Fragment caching
Couldn’t do static :/API and website were too dynamic
We used a hybrid       solution Using a combination of them alldepending on what makes sense
API list requests are cache with a templating system(ala Server Side Includes)
We use heavily webfragment caching on the        website
Video lists are loadedthrough ajax on the website
Object cachingReduce database hits per backend            request
We ❤ MemcacheSuper Fast in memory Key/Value              store
Optimizing memcache ;)      Our custom tricks!
Avoid keys expiration!
Use get multiVideo.get_by_ids([1, 2, 3, 4, 5])             =      One memcache call
Compress and serialize!Cache        = Rails.cache.instance_variable_get("@data")SnappyCache     = Memcached::SnappyCache.n...
DB is breakingagain Too many    writes!
(And Mongo doesn’t scale)
For us...
to therescue
Redis is like Memcache   But support complex structures:    Lists, sets, sorted sets, etc..
Initially, we stored videos in   Postres, and computed      feeds per requestThen we de-normalized it in        Mongo...
This broke.
We switched to a canonical normalized list of videos  stored in Postgresql...
Coupled with a de- normalized video feed foractive users stored in Redis
The mongo problem wasfixed, but all was still not           well
We had1Billion rows   stored in Postgres(and there was only 25gb left on our main db)
Postgres could not take it
Joins are evilReplaced them by application logic
Then you can movetables to their own DBs
(Don’t tell anyone: It’s ok toduplicate some data to kill joins)
It was not   enough!Followers table was too big for one server :/
Time to shard
Sharding is cutting atable into multiple pieces,  each on its own server
Application logic routesqueries to the right shard  based on some field (id, name,           country...)
You can make thatsimple or complex
We decided to go simple          :)        user_id % N
Relation.shard(user_id).count
With dynamic migration:class Relation def shard(user_id)   if Redis.sismember("relation_shard", user_id)     # return new ...
To conclude   Wesurvived!
By... making heavy useof AWS and open source      Postgres is awesome.       Redis is awesome.      Haproxy is awesome.
By not fearing hacks and   not being afraid of    breaking things
By turning off expensive,    non-vital features (or entire countries; sorry Brazil!)
By calling in help from        friends
And most importantly...
By only fixing things that     were broken
STP201 Efficiency at Scale - AWS re: Invent 2012
STP201 Efficiency at Scale - AWS re: Invent 2012
Upcoming SlideShare
Loading in …5
×

STP201 Efficiency at Scale - AWS re: Invent 2012

1,808 views

Published on

In May of 2012, Socialcam exploded, gaining tens of millions of new users in just a few weeks. At the time, the service ran on 15 servers in a co-location facility in San Francisco. To meet new user traffic demands and continue to deliver maximum user satisfaction, Socialcam made the move to cloud services. With only two engineers and a constant barrage of users, there was limited time for technical transition, but Socialcam endured with no significant downtime. In this technical session, Socialcam co-founders Guillaume Luccisano and Ammon Bartram talk about their experience scaling Socialcam. They present the challenges they encountered, how they addressed them, and the technologies they used in the process. They focus particularly on how they used Amazon services in conjunction with their own hardware to keep Socialcam active with no significant downtime and no costly system redesign.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,808
On SlideShare
0
From Embeds
0
Number of Embeds
538
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

STP201 Efficiency at Scale - AWS re: Invent 2012

  1. 1. Ammon Bartram• Wrote the iOs Socialcam app• Previously lead video engineer at Justin.tv• Used to own a cow named Snork Maiden
  2. 2. Guillaume Luccisano• Built the Socialcam backend• Previously worked on Rails performance at Justin.tv• French dude exploring the valley
  3. 3. We grew explosively ... and survived Here is how
  4. 4. We’d just finishedY-Combinator, and were growing quickly
  5. 5. Simple rails stack, 15servers in San Francisco, Postgres/Mongo dbs
  6. 6. Everything was working
  7. 7. Then...
  8. 8. BOOM!True viral growth
  9. 9. 30k new users an hour and growing... Things start to break
  10. 10. Not enough CPU!
  11. 11. Day 1
  12. 12. Add boxes in the colo?We can’t. Little space in thecolo. Servers take 2 weeks to arrive
  13. 13. Add EC2 instances?How can servers on AWS talk to our database? SSH tunnels! Then VPN!
  14. 14. Are video comments essential? No OK, kill them!
  15. 15. Is non US traffic essential? No OK, kill it
  16. 16. Yeah Baby!Kill switches are born!
  17. 17. Barebone deployment...Push code live to production boxes(And no more test suite :p)
  18. 18. It’s 9am, time to sleep... Got back to work the same day at 11am... Site is up, but not healthy!
  19. 19. Day 2
  20. 20. Move things to workers!
  21. 21. Add more monitoring
  22. 22. Add instancesAmazon has resource limits. We hitthem. Emergency late-night calls. Thanks Tim!
  23. 23. DB is breakingToo many reads!
  24. 24. Buy more time!
  25. 25. We run our main db inour own colo - a big SSD box
  26. 26. We order faster drives and more memory (on amazon.com :p)
  27. 27. And take down the site to upgrade It’s bad, but sometimes, it’s ok!
  28. 28. Caching!
  29. 29. Purpose of caching:Minimize backend requests per user request
  30. 30. Common caching techniques:Static full-page caching Server side includes Ajax loading Fragment caching
  31. 31. Couldn’t do static :/API and website were too dynamic
  32. 32. We used a hybrid solution Using a combination of them alldepending on what makes sense
  33. 33. API list requests are cache with a templating system(ala Server Side Includes)
  34. 34. We use heavily webfragment caching on the website
  35. 35. Video lists are loadedthrough ajax on the website
  36. 36. Object cachingReduce database hits per backend request
  37. 37. We ❤ MemcacheSuper Fast in memory Key/Value store
  38. 38. Optimizing memcache ;) Our custom tricks!
  39. 39. Avoid keys expiration!
  40. 40. Use get multiVideo.get_by_ids([1, 2, 3, 4, 5]) = One memcache call
  41. 41. Compress and serialize!Cache = Rails.cache.instance_variable_get("@data")SnappyCache = Memcached::SnappyCache.new(Cache)SnappyYajlCache = Memcached::SnappyYajlCache.new(Cache)SnappyMarshalCache = Memcached::SnappyMarshalCache.new(Cache)
  42. 42. DB is breakingagain Too many writes!
  43. 43. (And Mongo doesn’t scale)
  44. 44. For us...
  45. 45. to therescue
  46. 46. Redis is like Memcache But support complex structures: Lists, sets, sorted sets, etc..
  47. 47. Initially, we stored videos in Postres, and computed feeds per requestThen we de-normalized it in Mongo...
  48. 48. This broke.
  49. 49. We switched to a canonical normalized list of videos stored in Postgresql...
  50. 50. Coupled with a de- normalized video feed foractive users stored in Redis
  51. 51. The mongo problem wasfixed, but all was still not well
  52. 52. We had1Billion rows stored in Postgres(and there was only 25gb left on our main db)
  53. 53. Postgres could not take it
  54. 54. Joins are evilReplaced them by application logic
  55. 55. Then you can movetables to their own DBs
  56. 56. (Don’t tell anyone: It’s ok toduplicate some data to kill joins)
  57. 57. It was not enough!Followers table was too big for one server :/
  58. 58. Time to shard
  59. 59. Sharding is cutting atable into multiple pieces, each on its own server
  60. 60. Application logic routesqueries to the right shard based on some field (id, name, country...)
  61. 61. You can make thatsimple or complex
  62. 62. We decided to go simple :) user_id % N
  63. 63. Relation.shard(user_id).count
  64. 64. With dynamic migration:class Relation def shard(user_id) if Redis.sismember("relation_shard", user_id) # return new correct sharded table else # return old relation table end endend
  65. 65. To conclude Wesurvived!
  66. 66. By... making heavy useof AWS and open source Postgres is awesome. Redis is awesome. Haproxy is awesome.
  67. 67. By not fearing hacks and not being afraid of breaking things
  68. 68. By turning off expensive, non-vital features (or entire countries; sorry Brazil!)
  69. 69. By calling in help from friends
  70. 70. And most importantly...
  71. 71. By only fixing things that were broken

×