How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

•Download as PPTX, PDF•

0 likes•1,440 views

The Canadian Broadcasting Corporation is Canada's national public broadcaster. Our website, www.cbc.ca, is one of the largest and most visited in the country, delivering 700 million hits per day on an origin infrastructure composed of only six web servers. With the right combination of publishing methods, content delivery networks and fine-tuned caching rules, the CBC’s infrastructure has enough headroom to handle spikes of 40x normal traffic during major news events. How do you scale to almost infinite capacity when you can't predict the world’s events? It's impossible to prepare for that influx of visitors when a celebrity dies, a natural disaster occurs or for other breaking news. Scaling for predictable events is easier, but although we know when the next Federal Election, Olympics Games or FIFA Cup is scheduled, these events present different challenges. Balancing the architecture for both scenarios is important.

Technology

How Do you Scale for both
Predictable and
Unpredictable Events on
such a Large Scale?
Surge 2013

We’re going to talk about this:
Whitney Houston Death: February 11, 2012

Who Am I?
• Team Lead of CBC.ca System Administration team.
• Been with CBC for over 11 years (since 2002).
• @blakecrosby
• me@blakecrosby.com / blake.crosby@cbc.ca

“News stories must appear on the site as fast as
possible!”
- Every Journalist at CBC

This architecture doesn’t work for news websites.

Breaking news traffic
It’s unpredictable and short lived.

From 12k hit/s to 30k hit/s
Royal Baby: July 22, 2013

From 1Gbps to 2.5Gbps in ~7min
Boston Marathon Bombing: April 15, 2013

From 1 Gbps to 14 Gbps in ~10 minutes.
Whitney Houston Death: February 11, 2012

Too expensive to build out infrastructure for traffic
levels that are sustained < 1% of the year.

Content must be flexible to changing traffic conditions

We have valuable information that users need in a
crisis.

How we fixed this problem
(back in 2003, remember?)

Advantages
• Observes the principal of least surprise.
• Fast
• Takes advantages of OS and FS caches
• Easy to turn off certain site features.

Using SSIs (Server Side Includes)
• Primitive, but fast and secure.
• Can turn off site features or change look and feel by editing one file.
• All pages are updated instantly, without having to wait for pages to be
republished.

Use Conditional GETs (If-Modified-Since)

Using Expiry and Validation
• Object has a TTL of 30 Seconds.
• Object hast a last modified time of Jan 1, 2013 00:00:00
• Once TTL has expired, cache/CDN will check if object is updated.
• Origin will return "304 Not Modified" and cache will reset TTL and
serve object from cache store.
• The 30 second TTL protects the origin from a deluge of "If modified
since" requests.

Use Last Mile Acceleration (GZIP Compression)

Use Appropriate Cache TTLs. Keep them simple!

Outcomes
• 2003 to 2010 – No need to grow origin
• 2010 to today – 9 origin web servers
• HP DL360 G7
• Average 45-50% CPU utilization
• Capital cost for hardware? $15,000!

Our secret sauce.
(or how to serve 800M requests a day from 9 webservers)

Checking the last time a file has changed is faster than
delivering that file to a user.

Make sure users don’t have to search for content

Some (loose) rules.
• Scheduled events don't peak has high as unpredictable ones.
• Scheduled events last longer, so increase in traffic is spread out over
hours, days, or weeks.
• Scheduled events are more "niche". Unlike breaking news where
everyone wants to know what's going on.
• Might have to worry about 95/5 and bandwidth overages.

• Ensure your TTLs are appropriate
• Make sure your applications/content return last modified headers.
• Don't be afraid to change your site to turn off components that aren't
critical during high traffic periods.
• Keep tunables at the Origin. This allows you to make changes quickly
without waiting for CDN propagation.
• A CDN will not replace or fix bad origin infrastructure!

• Predicting the scale of a scheduled event is impossible. You will either
over estimate or under estimate.
• Use previous traffic levels during unscheduled events as a high water
mark.
• Don't be afraid to ask someone else (SaaS provider) to implement a
feature that is not your core business/expertise.

Usenix Paper
http://tinyurl.com/lisa-paper

Thank You
@blakecrosby
me@blakecrosby.com

What's hot

FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...Zhenzhong Xu

Prometheus on AWSMitsuhiro Tanda

Surge openstackKomei Shimamura

Kafka StreamsCristiano Altmann

DataSift Historics in 5 StepsNick Halstead

Cassandra summit 2015 - Simplifying Streaming AnalyticsBrenden Matthews

Mesos meetup @ shutterstockBrenden Matthews

Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy

Operational elasticEd Anderson

Cloudsolutionday 2016: Docker & FAAS at getvero.comAWS Vietnam Community

Connecting Akka with Oracle Event Hub Cloud ServiceDalibor Blazevic

Fine tuning Hybrid Mobile AppAllan Tan

Brisbane DevOps Meetup - Reinvent 2015Michael Villis

MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB

AjaxWBUTTUTORIALS

Real time DQMM on FlinkJaydeep Vishwakarma

[db tech showcase Tokyo 2017] C16: Azure SQL Database - Are you ready for the...Insight Technology, Inc.

Scaling real-time visualisations for Elections 2014Gramener

[db tech showcase Tokyo 2017] C32: Patterns for building hybrid scenarios wit...Insight Technology, Inc.

Introduction To Streaming Data and Stream Processing with Apache Kafkaconfluent

What's hot (20)

FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...

Prometheus on AWS

Surge openstack

Kafka Streams

DataSift Historics in 5 Steps

Cassandra summit 2015 - Simplifying Streaming Analytics

Mesos meetup @ shutterstock

Cassandra @ Sony: The good, the bad, and the ugly part 2

Operational elastic

Cloudsolutionday 2016: Docker & FAAS at getvero.com

Connecting Akka with Oracle Event Hub Cloud Service

Fine tuning Hybrid Mobile App

Brisbane DevOps Meetup - Reinvent 2015

MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart

Ajax

Real time DQMM on Flink

[db tech showcase Tokyo 2017] C16: Azure SQL Database - Are you ready for the...

Scaling real-time visualisations for Elections 2014

[db tech showcase Tokyo 2017] C32: Patterns for building hybrid scenarios wit...

Introduction To Streaming Data and Stream Processing with Apache Kafka

Viewers also liked

Intro to GISBlake Crosby

Moving from HTTP to HTTPSBlake Crosby

100 Terabytes a Day. How CBC Delivers Content to CanadiansBlake Crosby

The Canadian Public Broadcaster on a Diet: Slimming Down for a Whole NationBlake Crosby

Improving SEO at CBCBlake Crosby

Using PostgreSQL for Flight PlanningBlake Crosby

PageSpeed and SPDYBlake Crosby

Cache Optimization with AkamaiBlake Crosby

Viewers also liked (8)

Intro to GIS

Moving from HTTP to HTTPS

100 Terabytes a Day. How CBC Delivers Content to Canadians

The Canadian Public Broadcaster on a Diet: Slimming Down for a Whole Nation

Improving SEO at CBC

Using PostgreSQL for Flight Planning

PageSpeed and SPDY

Cache Optimization with Akamai

Similar to How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA

Lowering the risk of monolith to microservicesChristian Posta

Next generation web protocolsDaniel Austin

Kinesis @ lyftMian Hamid

Performing successful migrations to the microsoft cloudAndries den Haan

Optimize Your Reporting In Less Than 10 MinutesAlexandra Sasha Blumenfeld

Azure Messaging CrossroadsSean Feldman

Migrating on premises and cloud contents to SharePoint Online at no cost with...Juan Carlos Gonzalez

Handling Massive Traffic with PythonÒscar Vilaplana

Cloudlytics: In Depth S3 & CloudFront Log Analysis - Featuring ReportsBlazeclan Technologies Private Limited

LighthouseHsin-Hao Tang

Azure stream analytics by Nico JacobsITProceed

Deploy secure, scalable, and highly available web apps with Azure Front Door ...Stamo Petkov

The Hard Problems of Continuous DeploymentTimothy Fitz

Optimization of modern web applicationsEugene Lazutkin

Wix Dev-Centric Culture And Continuous DeliveryAviran Mordo

How to Overcome Data Challenges When Refactoring Monoliths to MicroservicesVMware Tanzu

Itsummit2015 blizzardkevin_donovan

System design for video streaming serviceNirmik Kale

Zero Downtime Migrations at ScaleAysylu Greenberg

Similar to How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale? (20)

Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...

Lowering the risk of monolith to microservices

Next generation web protocols

Kinesis @ lyft

Performing successful migrations to the microsoft cloud

Optimize Your Reporting In Less Than 10 Minutes

Azure Messaging Crossroads

Migrating on premises and cloud contents to SharePoint Online at no cost with...

Handling Massive Traffic with Python

Cloudlytics: In Depth S3 & CloudFront Log Analysis - Featuring Reports

Lighthouse

Azure stream analytics by Nico Jacobs

Deploy secure, scalable, and highly available web apps with Azure Front Door ...

The Hard Problems of Continuous Deployment

Optimization of modern web applications

Wix Dev-Centric Culture And Continuous Delivery

How to Overcome Data Challenges When Refactoring Monoliths to Microservices

Itsummit2015 blizzard

System design for video streaming service

Zero Downtime Migrations at Scale

Recently uploaded

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

MS Copilot expands with MS Graph connectorsNanddeep Nachan

FWD Group - Insurer Innovation Award 2024The Digital Insurer

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Why Teams call analytics are critical to your entire businesspanagenda

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Architecting Cloud Native ApplicationsWSO2

Recently uploaded (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

WSO2's API Vision: Unifying Control, Empowering Developers

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

MS Copilot expands with MS Graph connectors

FWD Group - Insurer Innovation Award 2024

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Why Teams call analytics are critical to your entire business

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Artificial Intelligence Chap.5 : Uncertainty

Architecting Cloud Native Applications

How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

1. How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale? Surge 2013

2. We’re going to talk about this: Whitney Houston Death: February 11, 2012

3. … and this:

4. Without your site going down…

5. Who Am I? • Team Lead of CBC.ca System Administration team. • Been with CBC for over 11 years (since 2002). • @blakecrosby • me@blakecrosby.com / blake.crosby@cbc.ca

8. Let’s go back in time… …way back

9. 2010

10. 2008

11. 2007

12. 2006

13. 2005

14. 2004

15. 2003

16. “News stories must appear on the site as fast as possible!” - Every Journalist at CBC

17.

18.

19.

20.

21. This architecture doesn’t work for news websites.

22. This was an important lesson for CBC

23. Breaking news traffic It’s unpredictable and short lived.

24. From 12k hit/s to 30k hit/s Royal Baby: July 22, 2013

25. From 1Gbps to 2.5Gbps in ~7min Boston Marathon Bombing: April 15, 2013

26. From 1 Gbps to 14 Gbps in ~10 minutes. Whitney Houston Death: February 11, 2012

27. Challenges we (or you) face

28. Too expensive to build out infrastructure for traffic levels that are sustained < 1% of the year.

29. Content must be flexible to changing traffic conditions

30. We have valuable information that users need in a crisis.

31. “News stories must appear on the site as fast as possible!” - Every Journalist at CBC

32. How we fixed this problem (back in 2003, remember?)

33.

34. Save everything to disk.

35. Advantages • Observes the principal of least surprise. • Fast • Takes advantages of OS and FS caches • Easy to turn off certain site features.

36.

37. Using SSIs (Server Side Includes) • Primitive, but fast and secure. • Can turn off site features or change look and feel by editing one file. • All pages are updated instantly, without having to wait for pages to be republished.

38. Use a Content Delivery Network

39. Use Conditional GETs (If-Modified-Since)

40.

41. Using Expiry and Validation • Object has a TTL of 30 Seconds. • Object hast a last modified time of Jan 1, 2013 00:00:00 • Once TTL has expired, cache/CDN will check if object is updated. • Origin will return "304 Not Modified" and cache will reset TTL and serve object from cache store. • The 30 second TTL protects the origin from a deluge of "If modified since" requests.

42.

43. Use Last Mile Acceleration (GZIP Compression)

44. Use persistent HTTP connections

45. Use Appropriate Cache TTLs. Keep them simple!

46. Keep tunable options at the origin

47. Move personalization to the client

48. Outcomes (Where we are now in 2013)

49. Outcomes • 2003 to 2010 – No need to grow origin • 2010 to today – 9 origin web servers • HP DL360 G7 • Average 45-50% CPU utilization • Capital cost for hardware? $15,000!

50. Our secret sauce. (or how to serve 800M requests a day from 9 webservers)

51. Offload (Bandwidth)

52. Offload (Hits)

53. Scaling for Unpredictable Events

54. Checking the last time a file has changed is faster than delivering that file to a user.

55. Conditional GETs (304s) will save you.

56. Make sure users don’t have to search for content

57. Increase your TTLs

58. Turn off dynamic components

59. Scaling for predictable events

60. Predicting traffic levels is impossible

61. Some (loose) rules. • Scheduled events don't peak has high as unpredictable ones. • Scheduled events last longer, so increase in traffic is spread out over hours, days, or weeks. • Scheduled events are more "niche". Unlike breaking news where everyone wants to know what's going on. • Might have to worry about 95/5 and bandwidth overages.

62. How do you scale for write operations?

63. We let someone else deal with that:

64. In Summary…

65. • Ensure your TTLs are appropriate • Make sure your applications/content return last modified headers. • Don't be afraid to change your site to turn off components that aren't critical during high traffic periods. • Keep tunables at the Origin. This allows you to make changes quickly without waiting for CDN propagation. • A CDN will not replace or fix bad origin infrastructure!

66. • Predicting the scale of a scheduled event is impossible. You will either over estimate or under estimate. • Use previous traffic levels during unscheduled events as a high water mark. • Don't be afraid to ask someone else (SaaS provider) to implement a feature that is not your core business/expertise.

67. Usenix Paper http://tinyurl.com/lisa-paper

68. Thank You @blakecrosby me@blakecrosby.com

Editor's Notes

CBC is Canadas Public BroadcasterCombination of NPR and PBS, but funded by tax dollars and not donationsHave a mandate to serve all canadians and produce canadian content.
Example of news website
In order to understand why our infrastructure the way it is, we need to go back to a specific event.
For CBC, this is when we started taking the web seriously. It's no longer a "fad" anymore.
We must beat our competitors online!
So naturally we decided to make the story presentation engine dynamically driven.Backed by an oracle database, and a J2EE front end.
That same year, we had a provincial election in the province of Quebec.
This is what the site looked likeReal time voting numbers on the front page for each party.
About 150 hits/s
This is when we realized that this architecture (at least back in 2003) wasn't appropriate for a news website.
We needed to simplify our infrastructure and presentation model.Running a dynamically generated News website is not scalable.
To get a better understanding of why this doesn't work. Let's take a look at typical traffic patterns for breaking news.
So how do you build out an infrastructure to be able to handle these huge spikes?
Capital costs are high and CPU utilization will be too low. Servers will be sitting idle the majority of the time.
Must be able to change the site based on what is important for visitors, while maintaining functionality that users expect.
Going down is not an option.
Remember this? We still need to make sure content is published as fast as it’s written.So long cache times are not acceptable.
The first thing we did was toss the database and j2ee app out the window.
We call this processing “baking”
Principal of least surprise. Files are located on disk where you think they’d be. No need to know SQL or hunt through database tables.
Indicate parts of the site that are controllable.Can turn off “more headlines”, right rail, or the ticker at the top.Or better yet, it's easy for us to put a notice at the top of every page, if we wanted.
Tried to make the backend as close to a cache as possible.Nothing gets into production w/o going through the CMS first.
Leveraging conditional GETs ensures that there is a small load on the origin, but pages are updated in cache as quickly as possible It's the right combination of expiry and validation
Using IMS allows the origin to return only a small payload.Body content is not sent.
75% of requests are for 304 not modified. Object was not transferred to the CDN.
HTML,Javascript, CSS, and other text based files compress very well.Be sure that you have this turned on between origin and CDN and CDN and your users.
Set up your persistent connections to match those of the CDN.Keeping the TCP connection open reduces the latency required to set up and tear down TCP sessionsCBC uses 301 seconds, 1 second longer than Akamais. This ensures that the origin doesn’t tear down the connection prematurely.We leave management of the connection to the CDN.
We have a blanket 20 second TTL on all objects.Understand that at the end of the TTL the object is probably not expired from cache. Just revalidated.If you know your content changes less frequently, or "freshness" is less a priority, then set a higher TTL.Organize file system based on TTL.
Store all your tunable configs at the origin (especially TTLs). This saves on propagation time when you have to change settings or TTLs.Updating an Apache configuration is quicker than pushing a CDN config change to 100,000 servers.
Personalization data is stored in Cookies.Origin doesn’t dynamically generate pages for users who are signed in. They just fetch a pre-baked file/template based on cookie data.Dynamic content is assembled using AJAX.
We wanted to increase the amount of headroom we had in 2010 so we refreshed our infrastructure hardware and added 3 more servers.The total cost was only $15,000.
So, how do we serve
We rely heavily on the CDN to deliver content.Our cache offload rate for bandwidth is around 93%.
Number of hits is a little lower at 80%.
We rely on the this fact.
… so we take advantage of 304s
More clicks = more traffic.Ensure that the news or information they are coming to your site can be found in 1 click or on the home page.Change your "website mode" to a lightweight mode. This will save you on bandwidth and ensure your users can find relevant information right away.
Since your TTLs are controlled at the origin, there is no need to wait for the CDN to propagate settings.An extra 10 seconds reduces origin load a lot, while keeping content “fresh”.
Anything that relies on a sign-on, or cookies should be turned off.
We've never been able to accurately predict what kind of load a specific event will generate.We usually have an office pool. Person to the closest peak hits/s or concurrent users wins!However, we do have some guidelines based on previous experience.
If you survived that spike in traffic during your last breaking news event, you're most likely going to be ok for your scheduled event.Niche: This is especially true for sporting events (NHL playoffs, olympics) and
This is all great, but your site is mainly read operations! How do we handle write operations such as comments, etc..?
Commenting Engine: Viafoura, DisqusPolls/Surveys: Survey Monkey, Poll Daddy, ZoomerangAnalytics: Adobe Omniture, Google analytics
- Think about your dynamic application. Is there a way to calculate a last modified header?

How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Similar to How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale? (20)

Recently uploaded

Recently uploaded (20)

How Do you Scale for both Predictable and Unpredictable Events on such a Large Scale?

Editor's Notes