High Scalability Toronto: Meetup #2

1,406 views

Published on

Slides from the second meeting of the Toronto High Scalability Meetup @ http://www.meetup.com/toronto-high-scalability/

-Basics of High Scalability and High Availability
-Using a CDN to Achieve 99% Offload
-Caching at the Code Layer

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,406
On SlideShare
0
From Embeds
0
Number of Embeds
28
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

High Scalability Toronto: Meetup #2

  1. 1. High ScalabilityBasics of scale and availability
  2. 2. Who am I?• Jonathan Keebler @keebler keebler.net• Built video player for all CTV properties• Worked on news sites like CP24, CTV, TSN• CTO, Founder of ScribbleLive• Bootstrapped a high scalability startup – Credit card limit wasn’t that high, had to find cheap ways to handle the load of top tier news sites 2
  3. 3. Sample load test17 x Windows Server 2008, 2 x Varnish, 4 x nginx, 1 x SQL Server 2008 3
  4. 4. Scalability vs Availability• Often talked about separately• Can’t have one without the other• Let’s talk about the basic building blocks 4
  5. 5. Building blocks• Content Distribution Network (CDN)• Load-balancer• Reverse proxy• Caching server• Origin server 5
  6. 6. Basic hosting structure 6
  7. 7. Basic hosting structure Akamai Amazon ELB nginx Varnish LAMP CloudFront F5 Squid ASP.NET EdgeCast HAProxy aiCache node.js 7
  8. 8. Basic hosting structure + Monitoring + Monitoring + Monitoring + Monitoring + Monitoring Akamai Amazon ELB nginx Varnish LAMP CloudFront F5 Squid ASP.NET EdgeCast HAProxy aiCache node.js 8
  9. 9. Monitor or die• If you aren’t monitoring your stack, you have NO IDEA what’s going on• Pingdom/WatchMouse/Gomez not enough – Don’t help you when you’re trying to figure out what’s going wrong – You need actionable metrics 9
  10. 10. Monitor or die• Outside monitoring e.g. Pingdom, Gomez – DNS problems, localized problem, SLA• Inside monitoring e.g. New Relic, CloudWatch, Server Density – High latency, CPU spikes, memory crunch, peek-a-boo servers, rogue processes, SQL queries per second, SQL wait time, SQL locks, disk usage, disk IO performance, page file usage, network traffic, requests per second, 10
  11. 11. New Relic• Dashboard 11
  12. 12. Alerting• Don’t send them to your email – Try to work with notifications coming in every second• PagerDuty• Don’t over do it = alert fatigue 12
  13. 13. Basic hosting structure• Now back to our servers... 13
  14. 14. Load-balancers• Bandwidth limits on dedicated boxes harder to work around• F5s are great boxes, but have lousy live reporting = can get into trouble quick• Adding/removing servers sucks• DNS load-balancing sucks for everyone 14
  15. 15. nginx• Fantastic at handling massive number of requests (low CPU, low memory)• Easy to configure and change on-the-fly• Gzip, modify headers, host names• Proxy with error intercept• No query string or IF-statement* support 15
  16. 16. Varnish• Caching server but so much more• Fantastic at handling massive number of requests (low CPU, low memory)• Easy to configure and change on-the-fly• Protect your origin servers• Deals with errors from origin servers 16
  17. 17. Origin servers• Whatever tweaks you make will never help enough – e.g. If your disk IO is becoming a problem, it’s already too late to save you• Keep them stock so you don’t blow your mind, easier to deploy• Handle any query string hacking in Varnish 17
  18. 18. Databases• No silver bullet• Two options: – Shard (split your data between servers) – Cluster (many boxes working together as one)• Shards commonly used today – Lots of work on code level, no incremental IDs• Clusters have a single point of failure – Try upgrading one and tell me they don’t 18
  19. 19. Discussion• What stack do you use?• What database do you use?• SQL vs NoSQL 19
  20. 20. High ScalabilityContent Distribution Networks
  21. 21. Basics• Worldwide network of DNS load-balanced reverse proxies• Not magic• Can achieve 99% offload if you do it right• Have to understand your requests 21
  22. 22. Market leaders• Akamai: market leader, $$$, most options, yearly contracts, pay for GB + request headers• CloudFront: built on AWS, cheaper, pay-as-you- go, less features, new features coming quickly, GB + pay-per-request• EdgeCast (pay-as-you-go through GoGrid), CloudFlare (optimizer, security, easy!) 22
  23. 23. Tiered distribution• More points-of-presence (POPs) = less caching if your traffic is global• Need to put a layer of servers between POPs and your servers• Sophisticated setups throttle requests – if 100 come in at same time, only 1 gets through 23
  24. 24. Cache keys• Need to have same query string to get cached result• Some CDNs can ignore params – important if you need a random number on the query string to prevent browser caching• Cool options: case sensitive/insensitive, cache differently based on cookie, headers 24
  25. 25. Invalidations suck• Trying to get CDN to drop its cache is hard – takes a long time to reach all POPs – triggers thundering herd – takes out all caching for a bit• Build the ability to change query strings at the code layer – e.g. add version number to JS/CSS URLs. When you rollout, breaks cache 25
  26. 26. How long to cache for?• As long as you need, but no longer• Make sure you think about error case i.e. what if an error gets cached – Some CDNs let you set your own rules for that – Remember, invalidations suck 26
  27. 27. Thundering herds 27
  28. 28. Thundering herds• When you rollout or have high latency, all your timeouts align – Origins get slammed at regular interval by POPs• Random TTLs are your friend – Just +/- a few minutes can be a big help – TIP: break into C in Varnish 28
  29. 29. Don’t build your own*• You will never be as smart as Akamai/Amazon• You will never be able to bring on new servers fast enough to scale• Spend your time building awesome software• Build your own caching layer for the POPs (and just in-case, to protect your origin servers) 29
  30. 30. Discussion• What CDN do you use?• War stories 30
  31. 31. High ScalabilityCaching in Code
  32. 32. Why do I need this?• You can’t cache every request• You can’t cache POST requests• Protect the database!• The longer you can go before you have to shard your database, the better 32
  33. 33. What is it?• In-process, in-memory caching• Static variables work great – TIP: .NET: static variables are scoped in the thread, WHY?!• Custom memory stores• Whatever you want, just not the disk 33
  34. 34. Isn’t that what Memcached is for?• Memcached is in-memory BUT so is your database – Advantages of Memcached over your database: • Cheaper to replicate • Fast lookups...if your db sucks – Disadvantages: • Still has network latency, higher than db lookup (unless your db sucks) • IT’S NOT A DATABASE! 34
  35. 35. Getting started• Think about your data + classes• TTLs based on knowledge of your data• Random TTLs (avoid the thundering herd)• Use concurrent, thread-safe objects• Wrap your code in try-catch – Caching isn’t worth breaking your site for 35
  36. 36. Updating cache• Use semaphores (that Comp Sci degree is finally going to come in handy)• Semaphores should always unlock on their own – Your thread could die/timeout at any time. You don’t want to lock forever• Use a separate thread for the lookup. Why should one user suffer?• Using a datetime semaphore is usually the best – keep a time when the next update will take place – 1st thread to hit that time, immediately adds some seconds to the time. Buys itself enough time to do lookup – Any blocked thread gets cached data. DON’T LOCK 36
  37. 37. Populating cache for first time• How do you prevent thundering herd before cache?• Ok, you may have to lock. But be smart about it.• Are you sure your database can’t handle it?• This is where other caching layers help: CDN throttling, Varnish throttling, Memcached, read- only databases 37
  38. 38. Garbage collection• Keep counters for metrics e.g. how many hits to the cached object, datetime of last request for that object• Every X something, run your garbage collection – Use semaphores – Don’t get rid of the most used objects• You are going to collide with running code – try-catch is your friend• Don’t be afraid to dump the cache and start over 38
  39. 39. Watch out for references• If you are storing something in a cache object, you can save a lot of memory by passing reference to object• Don’t forget about the reference• Watch out for garbage collection trying to destroy it• Updating cache operation might involve updating an existing object 39
  40. 40. The curse• More servers = more caches = less efficient• Discipline: can’t throw more servers at the problem 40
  41. 41. Totally worth it!Requests per minute to origin servers 41
  42. 42. Totally worth it!CPU of 1 x SQL Server 2008 database 42
  43. 43. Discussion• What do you use to cache at a code layer?• War stories 43
  44. 44. Thank you!• Jonathan Keebler• jonathan@scribblelive.com• @keebler 44

×