How a Small Team Scales Instagram

2,987 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1aNvLOQ.

Mike Krieger discusses Instagram's best and worst infrastructure decisions, building and deploying scalable and extensible services. Filmed at qconsf.com.

Mike Krieger (@mikeyk) graduated from Stanford University where he studied Symbolic Systems with a focus in Human-Computer Interaction. During his undergrad, he interned at Microsoft's PowerPoint team and after graduating, he worked at Meebo for a year and a half as a user experience designer and as a front-end engineer before joining the Instagram team doing design & development.

Published in: Technology

How a Small Team Scales Instagram

  1. 1. A Brief, Rapid History of Scaling Instagram
 (with a tiny team) Mike Krieger QConSF 2013 !
  2. 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /scaling-instagram InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  3. 3. Presented at QCon San Francisco www.qconsf.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. Hello!
  5. 5. Instagram
  6. 6. 30 million with 2 eng (2010-end 2012)
  7. 7. 150 million with 6 eng (2012-now)
  8. 8. How we scaled
  9. 9. What I would have done differently
  10. 10. What tradeoffs you make when scaling with that size team
  11. 11. (if you can help it, have a bigger team)
  12. 12. perfect solutions
  13. 13. survivor bias
  14. 14. decision-making process
  15. 15. Core principles
  16. 16. Do the simplest thing first
  17. 17. Every infra moving part is another “thread” your team has to manage
  18. 18. Test & Monitor Everything
  19. 19. This talk Early days Year 1: Scaling Up Year 2: Scaling Out Year 3-present: Stability, Video, FB
  20. 20. Getting Started
  21. 21. 2010 2 guys on a pier
  22. 22. no one <3s it
  23. 23. Focus
  24. 24. Mike iOS, Kevin Server
  25. 25. Early Stack Django + Apache mod_wsgi Postgres Redis Gearman Memcached Nginx
  26. 26. If today Django + uWSGI Postgres Redis Celery Memcached HAproxy
  27. 27. Three months later
  28. 28. Server planning night before launch
  29. 29. Traction!
  30. 30. Year 1: Scaling Up
  31. 31. scaling.enable()
  32. 32. Single server in LA
  33. 33. infra newcomers
  34. 34. “What’s a load average?”
  35. 35. “Can we get another server?”
  36. 36. Doritos & Red Bull & Animal Crackers & Amazon EC2
  37. 37. Underwater on recruiting
  38. 38. 2 total engineers
  39. 39. Scale "just enough" to get back to working on app
  40. 40. Every weekend was an accomplishment
  41. 41. “Infra is what happens when you’re busy making other plans” —Ops Lennon
  42. 42. Scaling up DB
  43. 43. First bottleneck: disk IO on old Amazon EBS
  44. 44. At the time: ~400 IOPS max
  45. 45. Simple thing first
  46. 46. Vertical partitioning
  47. 47. Django DB Routers
  48. 48. Partitions Media Likes Comments Everything else
  49. 49. PG Replication to bootstrap nodes
  50. 50. Bought us some time
  51. 51. Almost no application logic changes (other than some primary keys)
  52. 52. Today: SSD and provisioned IOPS get you way further
  53. 53. Scaling up Redis
  54. 54. Purely RAM-bound
  55. 55. fork() and COW
  56. 56. Vertical partitioning by data type
  57. 57. No easy migration story; mostly double-writing
  58. 58. Replicating + deleting often leaves fragmentation
  59. 59. Chaining replication = awesome
  60. 60. Scaling Memcached
  61. 61. Consistent hashing / ketama
  62. 62. Mind that hash function
  63. 63. Why not Redis for kv caching?
  64. 64. Slab allocator
  65. 65. Config Management & Deployment
  66. 66. fabric + parallel git pull (sorry GitHub)
  67. 67. All AMI based snapshots for new instances
  68. 68. update_ami.sh
  69. 69. update_update_ami.sh
  70. 70. Should have done Chef earlier
  71. 71. Munin monitoring
  72. 72. df, CPU, iowait
  73. 73. Ending the year
  74. 74. Infra going from 10% time to 70%
  75. 75. Focus on client
  76. 76. Testing & monitoring kept concurrent fires to a minimum
  77. 77. Several ticking time bombs
  78. 78. Year 2: Scaling Out
  79. 79. App tier
  80. 80. Stateless, but plentiful
  81. 81. HAProxy (Dead node detection)
  82. 82. Connection limits everywhere
  83. 83. PGBouncer Homegrown Redis pool
  84. 84. Hard to track down kernel panics
  85. 85. Skip rabbit hole; use instancestatus to detect and restart
  86. 86. Database Scale Out
  87. 87. Out of IO again (Pre SSDs)
  88. 88. Biggest mis-step
  89. 89. NoSQL?
  90. 90. Call our friends
  91. 91. and strangers
  92. 92. Theory: partitioning and rebalancing are hard to get right, let DB take care of it
  93. 93. MongoDB (1.2 at the time)
  94. 94. Double write, shadow reads
  95. 95. Stressing about Primary Key
  96. 96. Placed in prod
  97. 97. Data loss, segfaults
  98. 98. Could have made it work…
  99. 99. …but it would have been someone’s full time job
  100. 100. (and we still only had 3 people)
  101. 101. train + rapidly approaching cliff
  102. 102. Sharding in Postgres
  103. 103. QCon to the rescue
  104. 104. Similar approach to FB (infra foreshadowing?)
  105. 105. Logical partitioning, done at application level
  106. 106. Simplest thing; skipped abstractions & proxies
  107. 107. Pre-split
  108. 108. 5000 partitions
  109. 109. note to self: pick a power of 2 next time
  110. 110. Postgres "schemas"
  111. 111. database schema table columns
  112. 112. machineA: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
  113. 113. machineA: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user machineA’: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
  114. 114. machineA: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user machineA’: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
  115. 115. Still how we scale PG today
  116. 116. 9.2 upgrade: bucardo to move schema by schema
  117. 117. ID generation
  118. 118. Requirements No extra moving parts 64 bits max Time ordered Containing partition key
  119. 119. 41 bits: time in millis (41 years of IDs) 13 bits: logical shard ID 10 bits: auto-incrementing sequence, modulo 1024.
  120. 120. This means we can generate 1024 IDs, per shard, per table, per millisecond
  121. 121. Lesson learned
  122. 122. A new db is a full time commitment
  123. 123. Be thrifty with your existing tech
  124. 124. = minimize moving parts
  125. 125. Scaling configs/host discovery
  126. 126. ZooKeeper or DNS server?
  127. 127. No team to maintain
  128. 128. /etc/hosts
  129. 129. ec2tag KnownAs
  130. 130. fab update_etc_hosts (generates, deploys)
  131. 131. Limited: dead host failover, etc
  132. 132. But zero additional infra, got the job done, easy to debug
  133. 133. Monitoring
  134. 134. Munin: too coarse, too hard to add new stats
  135. 135. StatsD & Graphite
  136. 136. Simple tech
  137. 137. statsd.timer statsd.incr
  138. 138. Step change in developer attitude towards stats
  139. 139. <5 min from wanting to measure, to having a graph
  140. 140. 580 statsd counters 164 statsd timers
  141. 141. Ending the year
  142. 142. Launched Android
  143. 143. (doubling all of our infra, most of which was now horizontally scalable)
  144. 144. Doubled active users in < 6 months
  145. 145. Finally, slowly, building up team
  146. 146. Year 3+: Stability, Video, FB
  147. 147. Scale tools to match team
  148. 148. Deployment & Config Management
  149. 149. Finally 100% on Chef
  150. 150. Simple thing first: knife and chef-solo
  151. 151. Every new hire learns Chef
  152. 152. Code deploys
  153. 153. Many rollouts a day
  154. 154. Continuous integration
  155. 155. But push still needs a driver
  156. 156. "Ops Lock"
  157. 157. Humans are terrible distributed locking systems
  158. 158. Sauron
  159. 159. Redis-enforced locks
  160. 160. Rollout / major config changes / live deployment tracking
  161. 161. Extracting approach Hit issue Develop manual approach Build tools to improve manual / hands on approach Replace manual with automated system
  162. 162. Monitoring
  163. 163. Munin finally broke
  164. 164. Ganglia for graphing
  165. 165. Sensu for alerting (http://sensuapp.org)
  166. 166. StatsD/Graphite still chugging along
  167. 167. waittime: lightweight slow component tracking
  168. 168. s = time.time() # do work statsd.incr("waittime.VIEWNAME.C OMPONENT", time.time() - s)
  169. 169. asPercent()
  170. 170. Feeds and Inboxes
  171. 171. Redis
  172. 172. In memory requirement
  173. 173. Every churned or inactive user
  174. 174. Inbox moved to Cassandra
  175. 175. 1000:1 write/read
  176. 176. Prereq: having rbranson, ex-DataStax
  177. 177. C* cluster is 20% of the size of Redis one
  178. 178. Main feed (timeline) still in Redis
  179. 179. Knobs
  180. 180. Dynamic ramp-ups and config
  181. 181. Previously: required deploy
  182. 182. knobs.py
  183. 183. Only ints
  184. 184. Stored in Redis
  185. 185. Refreshed every 30s
  186. 186. knobs.get(feature_name, default)
  187. 187. Uses Incremental feature rollouts Dynamic page sizing (shedding load) Feature killswitches
  188. 188. As more teams around FB contribute
  189. 189. Decouple deploy from feature rollout
  190. 190. Video
  191. 191. Launch a top 10 video site on day 1 with team of 6 engineers, in less than 2 months
  192. 192. Reuse what we know
  193. 193. Avoid magic middleware
  194. 194. VXCode
  195. 195. Separate from main App servers
  196. 196. Django-based
  197. 197. server-side transcoding
  198. 198. ZooKeeper ephemeral nodes for detection
  199. 199. (finally worth it / doable to deploy ZK)
  200. 200. EC2 autoscaling
  201. 201. Priority list for clients
  202. 202. Transcoding tier is completely stateless
  203. 203. statsd waterfall
  204. 204. holding area for debugging bad videos
  205. 205. 5 million videos in first day 40h of video / hour
  206. 206. (other than perf improvements we’ve basically not touched it since launch)
  207. 207. FB
  208. 208. Where can we skip a few years?
  209. 209. (at our own pace)
  210. 210. Spam fighting
  211. 211. re.compile(‘f[o0][1l][o0]w’)
  212. 212. Simplest thing did not last
  213. 213. Generic features + machine learning
  214. 214. Hadoop + Hive + Presto
  215. 215. "I wonder how they..."
  216. 216. Two-way exchange
  217. 217. 2010 vintage infra
  218. 218. #1 impact: recruiting
  219. 219. Backend team: >10 people now
  220. 220. Wrap up
  221. 221. Core principles
  222. 222. Do the simplest thing first
  223. 223. Every infra moving part is another “thread” your team has to manage
  224. 224. Test & Monitor Everything
  225. 225. Takeaways
  226. 226. Recruit way earlier than you'd think
  227. 227. Simple doesn't always imply hacky
  228. 228. Rocketship scaling has been (somewhat) democratized
  229. 229. Huge thanks to IG Eng Team
  230. 230. mikeyk@instagram.com
  231. 231. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/scalinginstagram

×