The Canadian Broadcasting Corporation is Canada's national public broadcaster. Our website, www.cbc.ca, is one of the largest and most visited in the country, delivering 700 million hits per day on an origin infrastructure composed of only six web servers.
With the right combination of publishing methods, content delivery networks and fine-tuned caching rules, the CBC’s infrastructure has enough headroom to handle spikes of 40x normal traffic during major news events.
How do you scale to almost infinite capacity when you can't predict the world’s events? It's impossible to prepare for that influx of visitors when a celebrity dies, a natural disaster occurs or for other breaking news. Scaling for predictable events is easier, but although we know when the next Federal Election, Olympics Games or FIFA Cup is scheduled, these events present different challenges. Balancing the architecture for both scenarios is important.
5. Who Am I?
• Team Lead of CBC.ca System Administration team.
• Been with CBC for over 11 years (since 2002).
• @blakecrosby
• me@blakecrosby.com / blake.crosby@cbc.ca
35. Advantages
• Observes the principal of least surprise.
• Fast
• Takes advantages of OS and FS caches
• Easy to turn off certain site features.
36.
37. Using SSIs (Server Side Includes)
• Primitive, but fast and secure.
• Can turn off site features or change look and feel by editing one file.
• All pages are updated instantly, without having to wait for pages to be
republished.
41. Using Expiry and Validation
• Object has a TTL of 30 Seconds.
• Object hast a last modified time of Jan 1, 2013 00:00:00
• Once TTL has expired, cache/CDN will check if object is updated.
• Origin will return "304 Not Modified" and cache will reset TTL and
serve object from cache store.
• The 30 second TTL protects the origin from a deluge of "If modified
since" requests.
49. Outcomes
• 2003 to 2010 – No need to grow origin
• 2010 to today – 9 origin web servers
• HP DL360 G7
• Average 45-50% CPU utilization
• Capital cost for hardware? $15,000!
61. Some (loose) rules.
• Scheduled events don't peak has high as unpredictable ones.
• Scheduled events last longer, so increase in traffic is spread out over
hours, days, or weeks.
• Scheduled events are more "niche". Unlike breaking news where
everyone wants to know what's going on.
• Might have to worry about 95/5 and bandwidth overages.
65. • Ensure your TTLs are appropriate
• Make sure your applications/content return last modified headers.
• Don't be afraid to change your site to turn off components that aren't
critical during high traffic periods.
• Keep tunables at the Origin. This allows you to make changes quickly
without waiting for CDN propagation.
• A CDN will not replace or fix bad origin infrastructure!
66. • Predicting the scale of a scheduled event is impossible. You will either
over estimate or under estimate.
• Use previous traffic levels during unscheduled events as a high water
mark.
• Don't be afraid to ask someone else (SaaS provider) to implement a
feature that is not your core business/expertise.
CBC is Canadas Public BroadcasterCombination of NPR and PBS, but funded by tax dollars and not donationsHave a mandate to serve all canadians and produce canadian content.
Example of news website
In order to understand why our infrastructure the way it is, we need to go back to a specific event.
For CBC, this is when we started taking the web seriously. It's no longer a "fad" anymore.
We must beat our competitors online!
So naturally we decided to make the story presentation engine dynamically driven.Backed by an oracle database, and a J2EE front end.
That same year, we had a provincial election in the province of Quebec.
This is what the site looked likeReal time voting numbers on the front page for each party.
About 150 hits/s
This is when we realized that this architecture (at least back in 2003) wasn't appropriate for a news website.
We needed to simplify our infrastructure and presentation model.Running a dynamically generated News website is not scalable.
To get a better understanding of why this doesn't work. Let's take a look at typical traffic patterns for breaking news.
So how do you build out an infrastructure to be able to handle these huge spikes?
Capital costs are high and CPU utilization will be too low. Servers will be sitting idle the majority of the time.
Must be able to change the site based on what is important for visitors, while maintaining functionality that users expect.
Going down is not an option.
Remember this? We still need to make sure content is published as fast as it’s written.So long cache times are not acceptable.
The first thing we did was toss the database and j2ee app out the window.
We call this processing “baking”
Principal of least surprise. Files are located on disk where you think they’d be. No need to know SQL or hunt through database tables.
Indicate parts of the site that are controllable.Can turn off “more headlines”, right rail, or the ticker at the top.Or better yet, it's easy for us to put a notice at the top of every page, if we wanted.
Tried to make the backend as close to a cache as possible.Nothing gets into production w/o going through the CMS first.
Leveraging conditional GETs ensures that there is a small load on the origin, but pages are updated in cache as quickly as possible It's the right combination of expiry and validation
Using IMS allows the origin to return only a small payload.Body content is not sent.
75% of requests are for 304 not modified. Object was not transferred to the CDN.
HTML,Javascript, CSS, and other text based files compress very well.Be sure that you have this turned on between origin and CDN and CDN and your users.
Set up your persistent connections to match those of the CDN.Keeping the TCP connection open reduces the latency required to set up and tear down TCP sessionsCBC uses 301 seconds, 1 second longer than Akamais. This ensures that the origin doesn’t tear down the connection prematurely.We leave management of the connection to the CDN.
We have a blanket 20 second TTL on all objects.Understand that at the end of the TTL the object is probably not expired from cache. Just revalidated.If you know your content changes less frequently, or "freshness" is less a priority, then set a higher TTL.Organize file system based on TTL.
Store all your tunable configs at the origin (especially TTLs). This saves on propagation time when you have to change settings or TTLs.Updating an Apache configuration is quicker than pushing a CDN config change to 100,000 servers.
Personalization data is stored in Cookies.Origin doesn’t dynamically generate pages for users who are signed in. They just fetch a pre-baked file/template based on cookie data.Dynamic content is assembled using AJAX.
We wanted to increase the amount of headroom we had in 2010 so we refreshed our infrastructure hardware and added 3 more servers.The total cost was only $15,000.
So, how do we serve
We rely heavily on the CDN to deliver content.Our cache offload rate for bandwidth is around 93%.
Number of hits is a little lower at 80%.
We rely on the this fact.
… so we take advantage of 304s
More clicks = more traffic.Ensure that the news or information they are coming to your site can be found in 1 click or on the home page.Change your "website mode" to a lightweight mode. This will save you on bandwidth and ensure your users can find relevant information right away.
Since your TTLs are controlled at the origin, there is no need to wait for the CDN to propagate settings.An extra 10 seconds reduces origin load a lot, while keeping content “fresh”.
Anything that relies on a sign-on, or cookies should be turned off.
We've never been able to accurately predict what kind of load a specific event will generate.We usually have an office pool. Person to the closest peak hits/s or concurrent users wins!However, we do have some guidelines based on previous experience.
If you survived that spike in traffic during your last breaking news event, you're most likely going to be ok for your scheduled event.Niche: This is especially true for sporting events (NHL playoffs, olympics) and
This is all great, but your site is mainly read operations! How do we handle write operations such as comments, etc..?