How Do you Scale for both
Predictable and
Unpredictable Events on
such a Large Scale?
Surge 2013
We’re going to talk about this:
Whitney Houston Death: February 11, 2012
… and this:
Without your site going
down…
Who Am I?
• Team Lead of CBC.ca System Administration team.
• Been with CBC for over 11 years (since 2002).
• @blakecrosby...
Let’s go back in time…
…way back
2010
2008
2007
2006
2005
2004
2003
“News stories must appear on the site as fast as
possible!”
- Every Journalist at CBC
This architecture doesn’t work for news websites.
This was an important lesson for CBC
Breaking news traffic
It’s unpredictable and short lived.
From 12k hit/s to 30k hit/s
Royal Baby: July 22, 2013
From 1Gbps to 2.5Gbps in ~7min
Boston Marathon Bombing: April 15, 2013
From 1 Gbps to 14 Gbps in ~10 minutes.
Whitney Houston Death: February 11, 2012
Challenges we (or you) face
Too expensive to build out infrastructure for traffic
levels that are sustained < 1% of the year.
Content must be flexible to changing traffic conditions
We have valuable information that users need in a
crisis.
“News stories must appear on the site as fast as
possible!”
- Every Journalist at CBC
How we fixed this problem
(back in 2003, remember?)
Save
everything to
disk.
Advantages
• Observes the principal of least surprise.
• Fast
• Takes advantages of OS and FS caches
• Easy to turn off ce...
Using SSIs (Server Side Includes)
• Primitive, but fast and secure.
• Can turn off site features or change look and feel b...
Use a Content Delivery Network
Use Conditional GETs (If-Modified-Since)
Using Expiry and Validation
• Object has a TTL of 30 Seconds.
• Object hast a last modified time of Jan 1, 2013 00:00:00
•...
Use Last Mile Acceleration (GZIP Compression)
Use persistent HTTP connections
Use Appropriate Cache TTLs. Keep them simple!
Keep tunable options at the origin
Move personalization to the client
Outcomes
(Where we are now in 2013)
Outcomes
• 2003 to 2010 – No need to grow origin
• 2010 to today – 9 origin web servers
• HP DL360 G7
• Average 45-50% CPU...
Our secret sauce.
(or how to serve 800M requests a day from 9 webservers)
Offload (Bandwidth)
Offload (Hits)
Scaling for Unpredictable Events
Checking the last time a file has changed is faster than
delivering that file to a user.
Conditional GETs (304s) will save you.
Make sure users don’t have to search for content
Increase your TTLs
Turn off dynamic components
Scaling for predictable events
Predicting traffic levels is impossible
Some (loose) rules.
• Scheduled events don't peak has high as unpredictable ones.
• Scheduled events last longer, so incre...
How do you scale for write operations?
We let someone else deal with that:
In Summary…
• Ensure your TTLs are appropriate
• Make sure your applications/content return last modified headers.
• Don't be afraid t...
• Predicting the scale of a scheduled event is impossible. You will either
over estimate or under estimate.
• Use previous...
Usenix Paper
http://tinyurl.com/lisa-paper
Thank You
@blakecrosby
me@blakecrosby.com
How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?
Upcoming SlideShare
Loading in …5
×

How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

1,274 views

Published on

The Canadian Broadcasting Corporation is Canada's national public broadcaster. Our website, www.cbc.ca, is one of the largest and most visited in the country, delivering 700 million hits per day on an origin infrastructure composed of only six web servers.
With the right combination of publishing methods, content delivery networks and fine-tuned caching rules, the CBC’s infrastructure has enough headroom to handle spikes of 40x normal traffic during major news events.
How do you scale to almost infinite capacity when you can't predict the world’s events? It's impossible to prepare for that influx of visitors when a celebrity dies, a natural disaster occurs or for other breaking news. Scaling for predictable events is easier, but although we know when the next Federal Election, Olympics Games or FIFA Cup is scheduled, these events present different challenges. Balancing the architecture for both scenarios is important.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,274
On SlideShare
0
From Embeds
0
Number of Embeds
166
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • CBC is Canadas Public BroadcasterCombination of NPR and PBS, but funded by tax dollars and not donationsHave a mandate to serve all canadians and produce canadian content.
  • Example of news website
  • In order to understand why our infrastructure the way it is, we need to go back to a specific event.
  • For CBC, this is when we started taking the web seriously. It&apos;s no longer a &quot;fad&quot; anymore.
  • We must beat our competitors online!
  • So naturally we decided to make the story presentation engine dynamically driven.Backed by an oracle database, and a J2EE front end.
  • That same year, we had a provincial election in the province of Quebec.
  • This is what the site looked likeReal time voting numbers on the front page for each party.
  • About 150 hits/s
  • This is when we realized that this architecture (at least back in 2003) wasn&apos;t appropriate for a news website.
  • We needed to simplify our infrastructure and presentation model.Running a dynamically generated News website is not scalable.
  • To get a better understanding of why this doesn&apos;t work. Let&apos;s take a look at typical traffic patterns for breaking news.
  • So how do you build out an infrastructure to be able to handle these huge spikes?
  • Capital costs are high and CPU utilization will be too low. Servers will be sitting idle the majority of the time.
  • Must be able to change the site based on what is important for visitors, while maintaining functionality that users expect.
  • Going down is not an option.
  • Remember this? We still need to make sure content is published as fast as it’s written.So long cache times are not acceptable.
  • The first thing we did was toss the database and j2ee app out the window.
  • We call this processing “baking”
  • Principal of least surprise. Files are located on disk where you think they’d be. No need to know SQL or hunt through database tables.
  • Indicate parts of the site that are controllable.Can turn off “more headlines”, right rail, or the ticker at the top.Or better yet, it&apos;s easy for us to put a notice at the top of every page, if we wanted.
  • Tried to make the backend as close to a cache as possible.Nothing gets into production w/o going through the CMS first.
  • Leveraging conditional GETs ensures that there is a small load on the origin, but pages are updated in cache as quickly as possible It&apos;s the right combination of expiry and validation
  • Using IMS allows the origin to return only a small payload.Body content is not sent.
  • 75% of requests are for 304 not modified. Object was not transferred to the CDN.
  • HTML,Javascript, CSS, and other text based files compress very well.Be sure that you have this turned on between origin and CDN and CDN and your users.
  • Set up your persistent connections to match those of the CDN.Keeping the TCP connection open reduces the latency required to set up and tear down TCP sessionsCBC uses 301 seconds, 1 second longer than Akamais. This ensures that the origin doesn’t tear down the connection prematurely.We leave management of the connection to the CDN.
  • We have a blanket 20 second TTL on all objects.Understand that at the end of the TTL the object is probably not expired from cache. Just revalidated.If you know your content changes less frequently, or &quot;freshness&quot; is less a priority, then set a higher TTL.Organize file system based on TTL.
  • Store all your tunable configs at the origin (especially TTLs). This saves on propagation time when you have to change settings or TTLs.Updating an Apache configuration is quicker than pushing a CDN config change to 100,000 servers.
  • Personalization data is stored in Cookies.Origin doesn’t dynamically generate pages for users who are signed in. They just fetch a pre-baked file/template based on cookie data.Dynamic content is assembled using AJAX.
  • We wanted to increase the amount of headroom we had in 2010 so we refreshed our infrastructure hardware and added 3 more servers.The total cost was only $15,000.
  • So, how do we serve
  • We rely heavily on the CDN to deliver content.Our cache offload rate for bandwidth is around 93%.
  • Number of hits is a little lower at 80%.
  • We rely on the this fact.
  • … so we take advantage of 304s
  • More clicks = more traffic.Ensure that the news or information they are coming to your site can be found in 1 click or on the home page.Change your &quot;website mode&quot; to a lightweight mode. This will save you on bandwidth and ensure your users can find relevant information right away.
  • Since your TTLs are controlled at the origin, there is no need to wait for the CDN to propagate settings.An extra 10 seconds reduces origin load a lot, while keeping content “fresh”.
  • Anything that relies on a sign-on, or cookies should be turned off.
  • We&apos;ve never been able to accurately predict what kind of load a specific event will generate.We usually have an office pool. Person to the closest peak hits/s or concurrent users wins!However, we do have some guidelines based on previous experience.
  • If you survived that spike in traffic during your last breaking news event, you&apos;re most likely going to be ok for your scheduled event.Niche: This is especially true for sporting events (NHL playoffs, olympics) and
  • This is all great, but your site is mainly read operations! How do we handle write operations such as comments, etc..?
  • Commenting Engine: Viafoura, DisqusPolls/Surveys: Survey Monkey, Poll Daddy, ZoomerangAnalytics: Adobe Omniture, Google analytics
  • - Think about your dynamic application. Is there a way to calculate a last modified header?
  • How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

    1. 1. How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale? Surge 2013
    2. 2. We’re going to talk about this: Whitney Houston Death: February 11, 2012
    3. 3. … and this:
    4. 4. Without your site going down…
    5. 5. Who Am I? • Team Lead of CBC.ca System Administration team. • Been with CBC for over 11 years (since 2002). • @blakecrosby • me@blakecrosby.com / blake.crosby@cbc.ca
    6. 6. Let’s go back in time… …way back
    7. 7. 2010
    8. 8. 2008
    9. 9. 2007
    10. 10. 2006
    11. 11. 2005
    12. 12. 2004
    13. 13. 2003
    14. 14. “News stories must appear on the site as fast as possible!” - Every Journalist at CBC
    15. 15. This architecture doesn’t work for news websites.
    16. 16. This was an important lesson for CBC
    17. 17. Breaking news traffic It’s unpredictable and short lived.
    18. 18. From 12k hit/s to 30k hit/s Royal Baby: July 22, 2013
    19. 19. From 1Gbps to 2.5Gbps in ~7min Boston Marathon Bombing: April 15, 2013
    20. 20. From 1 Gbps to 14 Gbps in ~10 minutes. Whitney Houston Death: February 11, 2012
    21. 21. Challenges we (or you) face
    22. 22. Too expensive to build out infrastructure for traffic levels that are sustained < 1% of the year.
    23. 23. Content must be flexible to changing traffic conditions
    24. 24. We have valuable information that users need in a crisis.
    25. 25. “News stories must appear on the site as fast as possible!” - Every Journalist at CBC
    26. 26. How we fixed this problem (back in 2003, remember?)
    27. 27. Save everything to disk.
    28. 28. Advantages • Observes the principal of least surprise. • Fast • Takes advantages of OS and FS caches • Easy to turn off certain site features.
    29. 29. Using SSIs (Server Side Includes) • Primitive, but fast and secure. • Can turn off site features or change look and feel by editing one file. • All pages are updated instantly, without having to wait for pages to be republished.
    30. 30. Use a Content Delivery Network
    31. 31. Use Conditional GETs (If-Modified-Since)
    32. 32. Using Expiry and Validation • Object has a TTL of 30 Seconds. • Object hast a last modified time of Jan 1, 2013 00:00:00 • Once TTL has expired, cache/CDN will check if object is updated. • Origin will return "304 Not Modified" and cache will reset TTL and serve object from cache store. • The 30 second TTL protects the origin from a deluge of "If modified since" requests.
    33. 33. Use Last Mile Acceleration (GZIP Compression)
    34. 34. Use persistent HTTP connections
    35. 35. Use Appropriate Cache TTLs. Keep them simple!
    36. 36. Keep tunable options at the origin
    37. 37. Move personalization to the client
    38. 38. Outcomes (Where we are now in 2013)
    39. 39. Outcomes • 2003 to 2010 – No need to grow origin • 2010 to today – 9 origin web servers • HP DL360 G7 • Average 45-50% CPU utilization • Capital cost for hardware? $15,000!
    40. 40. Our secret sauce. (or how to serve 800M requests a day from 9 webservers)
    41. 41. Offload (Bandwidth)
    42. 42. Offload (Hits)
    43. 43. Scaling for Unpredictable Events
    44. 44. Checking the last time a file has changed is faster than delivering that file to a user.
    45. 45. Conditional GETs (304s) will save you.
    46. 46. Make sure users don’t have to search for content
    47. 47. Increase your TTLs
    48. 48. Turn off dynamic components
    49. 49. Scaling for predictable events
    50. 50. Predicting traffic levels is impossible
    51. 51. Some (loose) rules. • Scheduled events don't peak has high as unpredictable ones. • Scheduled events last longer, so increase in traffic is spread out over hours, days, or weeks. • Scheduled events are more "niche". Unlike breaking news where everyone wants to know what's going on. • Might have to worry about 95/5 and bandwidth overages.
    52. 52. How do you scale for write operations?
    53. 53. We let someone else deal with that:
    54. 54. In Summary…
    55. 55. • Ensure your TTLs are appropriate • Make sure your applications/content return last modified headers. • Don't be afraid to change your site to turn off components that aren't critical during high traffic periods. • Keep tunables at the Origin. This allows you to make changes quickly without waiting for CDN propagation. • A CDN will not replace or fix bad origin infrastructure!
    56. 56. • Predicting the scale of a scheduled event is impossible. You will either over estimate or under estimate. • Use previous traffic levels during unscheduled events as a high water mark. • Don't be afraid to ask someone else (SaaS provider) to implement a feature that is not your core business/expertise.
    57. 57. Usenix Paper http://tinyurl.com/lisa-paper
    58. 58. Thank You @blakecrosby me@blakecrosby.com

    ×