Rate Limiting at Scale, from SANS AppSec Las Vegas 2012

1,472 views

Published on

Rate Limits at Scale SANS AppSec Las Vegas.

Rate Limit Everything All the time using a quantized time system with Memcache or Redis. Use this protect resources or discover anomalies.

Published in: Technology, Automotive
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,472
On SlideShare
0
From Embeds
0
Number of Embeds
267
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Rate Limiting at Scale, from SANS AppSec Las Vegas 2012

    1. 1. Rate-Limiting at Scale SANS AppSec Las Vegas 2012 Nick Galbreath @ngalbreath nickg@etsy.com
    2. 2. Who is Etsy? nick?• “Marketplace for Small Creative Businesses”• Alexa says #51 for USA traffic• > $500MM transaction volume last year• Billions and Billions of page views• Nick Galbreath Director of Engineering focusing on Security, Fraud, and other fun stuff
    3. 3. What’s a Rate Limit? Maximum number of events per (brief) period per userafter which the resource is denied.e.g. “no more than 2 logins per minute”
    4. 4. Why?
    5. 5. Robots gone Wild• Robots / Crawlers (not always an intended DDoS) • 20,000 items in shopping cart • spam attack!• Can crush sites very quickly, at almost no cost. Especially when crawl generates load or writes to the database
    6. 6. Humans are Resources too • Rate limits needed for anything that gets reviewed by humans such as customer service requests. • CRMs are typically bad at dealing with spammy stuff
    7. 7. Anything Involving Money• Without rate limits on credit card authorizations your site becomes a card skimmer site.• Using a website is much easier than going to the gas station pump or other anonymous card reader
    8. 8. Other Behaviors• Password Changes• Password resets• Credit card / email / bank info
    9. 9. Do Rate Limits Stop all Fraud? No, but...• Eliminates false positives and punks• Allows you to focus on more sophisticated attacks• Protects against damaging bursts of activity (malicious or not)
    10. 10. Rate Limits are needed on anything thatdepends on an external resource This is almost everything!
    11. 11. Implementation
    12. 12. Continuous Rate Limits• Store user identifier, event-type, timestamp• Allows easy rate-limits for multiple ranges• Allows easy cross-event limits• Easy to implement in SQL
    13. 13. check 25mcheck check 10m 10m
    14. 14. Continuous RL Schema Check if your database timestamps store microseconds or not. You want ‘em.
    15. 15. Continuous RL Queries
    16. 16. Ouch!• At scale, this is really painful for databases to handle.• Constant binary-tree index churn• Use in-memory database (or run off ramdisk) if trying this out
    17. 17. Quantized Rate Limits• Stores a count in a time-window or bucket.• Map current time to a bucket• (int) (NOW()/period) e.g. NOW()/3600 is gives the hour bucket.
    18. 18. Quanitzed time isn’t exactbucket-123 bucket-124 bucket-125 bucket- 10m 10m 10m 10m check check 2? 4 check 0?
    19. 19. Direct Lookup• Everything is a primary key lookup. userid-event-period-bucketid 60min: “nickg-login-3600-5589007547” 10min: “nickg-login-600-33534045284”• Multiple time-frames require multiple buckets, which means multiple inserting and checking.
    20. 20. Quantized RL Accuracy Not exact. If you set N per Period, quantized rate-limits may go as high as: (n-1)x2 per Period. e.g. 10 per minute --> 18 per minute Yikes. Maths!
    21. 21. In Pictures Rate Limit is “10”9 OK 9 OK 18 ooops
    22. 22. Rate-Limits at Scale• We traded exact accuracy and flexibility for scaling.• Implementation using Memcache or Redis (and perhaps SQL) set nickg-login-60-212331231 += 1• Well known sharding techniques• Auto-expiration of old buckets• Each set/get takes 1/10 or less of millisecond. Almost invisible.
    23. 23. Memory• Say 256 bytes per bucket• 10,000,000 buckets is a lot of bucket• But is only 2G, and fixed• This is easy on one machine.
    24. 24. Usage
    25. 25. Please write unit tests!• Easy to get wrong, and consequences can be unpleasant• Edge cases and race conditions • memcache doesn’t have a “insert or increment” operation. Need to do multiple steps and check error conditions.
    26. 26. Please make an API • Make it simple for anyone to add rate limiting to their code. • Make it one line// event, period, max eventsif (rate_limit_exceed("signin", 60, 5)) { // do something}
    27. 27. Rollout• Once in production start with guestimates on rate limits• If rate limit is triggered, take no action and only log/graph• Does volume match expectations?• Wash, Rinse, Repeat until tuned appropriately
    28. 28. oh yeah, don’t forget Put your rate-limit datastore behind the firewall
    29. 29. So a user hit a rate limit. Now what?a dialog with product, customer service and engineering • Do you let them know? (visible indicator) • Do you start CAPTCHA-ing? • Do you black hole it? (silent) Also keep logging and graphing. You’ll need these to debug when things go awry.
    30. 30. Intermission
    31. 31. I feel bad if I don’t use a graph in a presentation CAPTCHA Etsy API
    32. 32. How we do it• We use Graphite for real-time graphing http://graphite.wikidot.com/• We use StatsD as our API http://etsy.me/dQwVXi https://github.com/etsy/statsd• Our apps do this StatsD::increment(signins); UDP based -- can’t break the application
    33. 33. Division Built-in! Combine, Mix and Match data in Graphite to discover new insights. Seasonal data.Hard to alert onBut ratio of them is nearly constant. Easy to alert on. Who knew 1 in 5 logins are failures is universal?! p.s. Holt-Winters exponential smoothing is also built in
    34. 34. Ok back torate-limiting
    35. 35. Laddering• Use laddering to do rate limits at different time scales for the same event.• Set a short period and high rate to prevent bursts• Then set a longer period with lower rate to prevent slow crawls robots.
    36. 36. Ladder longer periodsto have a smaller rateNegative example:2 per Minute ( ~0.033 events per sec ) or 2x60 = 120 per Hour so laddering with300 per Hour (~ 0.083 events per sec) does nothing, but100 per Hour (~ 0.028) is good. oh no! the maths again!
    37. 37. In Pictures... Rate limit of “3 per 1 box” - ok Rate Limit 5 per 3 boxes -- alert! (good)but, say, rate limit 100 per 3 boxes does nothing and is impossible to trigger
    38. 38. Anonymous Identifiers
    39. 39. Anonymous Users• hash of (IP + appropriate HTTP headers)• order of headers matters different browsers order them differently• Spoofed user agents don’t always get the order right Different type of Anonymous User
    40. 40. Rate Limit Every IP?• Probably just Class C (only 16M of them)• Maybe useful for just alerting• Probably need whitelisting (e.g. AOL)
    41. 41. Rate Limit Datacenters http://github.com/client9/ipcat Datacenter / Rent-A-Slice / “hands not on keyboard” / leaseable CPU and network How much traffic is coming from them?
    42. 42. http://github.com/client9/ipcat No implication of wrong doing if on the list
    43. 43. • Almost every action on Etsy has laddered rate-limit• We learn the hard way what is not limited• Virtually no performance impact at scale• Should we open source the driver?
    44. 44. Nick Galbreath nickg@etsy.com @ngalbreath SANS AppSec Las Vegas 2012

    ×