BBC iPlayer: bigger, better, faster - as seen at AWSUKUG #11

2,166 views

Published on

BBC iPlayer: bigger, better, faster
A year ago, BBC iPlayer could have died - but it didn’t. Instead, we built a bigger, better, faster iPlayer that provides a foundation for the future. Find out how this was achieved, what part AWS plays in iPlayer’s success, and what’s next for BBC online media.”

The presentation I gave to the 11th meeting of the AWS UK UG, 24/09/2014: http://www.meetup.com/AWSUGUK/events/194314272/

Published in: Software

BBC iPlayer: bigger, better, faster - as seen at AWSUKUG #11

  1. 1. bigger, better, faster Rachel Evans BBC Media Services rachel.evans@bbc.co.uk @rvedotrc BBC iPlayer: bigger, better, faster A year ago, BBC iPlayer could have died - but it didn’t. Instead, we built a bigger, better, faster iPlayer that provides a foundation for the future. Find out how this was achieved, what part AWS plays in iPlayer’s success, and what’s next for BBC online media.” ! slow timing: +0:00
  2. 2. On October 1st, 2013, iPlayer didn’t die “On October 1st 2013, iPlayer didn’t die. But it could have. The reason iPlayer is still alive is Video Factory, and Amazon Web Services play a big part in Video Factory’s success. My name’s Rachel Evans. I’m a Principal Software Engineer in BBC Media Services. We created Video Factory.”
  3. 3. “For the next 45 minutes or so, I’d like to tell you the Video Factory story. How it came to exist; what it is; how we made it; and a glimpse into Video Factory’s future. And of course, what part Amazon plays in the whole story. I’ll be glad to answer your questions at the end.”
  4. 4. What is BBC Media Services? Video Factory was created by BBC Media Services. Who are we? Here’s our mission statement:
  5. 5. “Publish all BBC AV media produced for IP platforms” “AV” means audio and video, includes radio and TV Includes iPlayer, iPlayer Radio, News, Sport Both live and on-demand Let’s have a look at what this means in practice, for iPlayer on-demand
  6. 6. “Publish all BBC AV media produced for IP platforms” “AV” means audio and video, includes radio and TV Includes iPlayer, iPlayer Radio, News, Sport Both live and on-demand Let’s have a look at what this means in practice, for iPlayer on-demand
  7. 7. “AV” means audio and video, includes radio and TV Includes iPlayer, iPlayer Radio, News, Sport Both live and on-demand Let’s have a look at what this means in practice, for iPlayer on-demand
  8. 8. Here’s iPlayer. Two programmes that we’ve published. One that we haven’t. If you see too many of these, it might mean we messed up.
  9. 9. Here’s iPlayer. Two programmes that we’ve published. One that we haven’t. If you see too many of these, it might mean we messed up.
  10. 10. ✓ ✓ Here’s iPlayer. Two programmes that we’ve published. One that we haven’t. If you see too many of these, it might mean we messed up.
  11. 11. ✓ ✓ ⬅ ☹ Here’s iPlayer. Two programmes that we’ve published. One that we haven’t. If you see too many of these, it might mean we messed up.
  12. 12. “So this is the story of Video Factory, and this story, like many others, has a villain…” Yup, it’s ourselves, 5 years earlier.
  13. 13. BBC Media Services “So this is the story of Video Factory, and this story, like many others, has a villain…” Yup, it’s ourselves, 5 years earlier.
  14. 14. The shiny new system 2008 - Hosted in-house; used contracted 3rd party services for storage and transcode. Limited capacity. Bad architecture. Bad engineering. Ageing badly. Didn’t scale well. This system did not have a long-term future.
  15. 15. The legacy system 2008 - Hosted in-house; used contracted 3rd party services for storage and transcode. Limited capacity. Bad architecture. Bad engineering. Ageing badly. Didn’t scale well. This system did not have a long-term future.
  16. 16. Why 1st October 2013? In May 2012, the BBC decided not to renew that 3rd party contract. It was to be allowed to lapse, ending 30th September 2013. So this system, and therefore iPlayer, will die. “By the time the London 2012 Olympics was out of the way, we had just over 12 months to build a replacement.”
  17. 17. Start small ! Think big We start planning for Video Factory. ! Elasticity - peaks of demand (18 concurrent regional news). We want to use AWS, so let’s try it.
  18. 18. Start small ! Think big Spring 2012: iPlayer on Sky: first venture into the cloud. This proves that we can use the cloud for storing video. Tooling was in its infancy.
  19. 19. Start small ! Think big Jan/Feb 2013: iBroadcast2. Now we’re not just storing video in the cloud, we transcoding it there too. Now we’ve proved that the cloud can handle video storage and video transcode, both of which will be fundamental parts of Video Factory.
  20. 20. The origin of Video Factory ! ! ! ! ! ! “What does Video Factory actually do?”
  21. 21. Video Factory in a nutshell Two things drive Video Factory
  22. 22. source video programme data: - what programme is this; - where and when is it broadcast; - which platforms do we publication rights for
  23. 23. source video + programme data programme data: - what programme is this; - where and when is it broadcast; - which platforms do we publication rights for
  24. 24. source video + programme data = transcode, distribute, and publish programme data: - what programme is this; - where and when is it broadcast; - which platforms do we publication rights for
  25. 25. Live Prerecorded Transcode Distribute Publish Once we’ve got the video, the rest of the chain is the same. ! “So let’s talk about live. Most iPlayer content is stuff that’s been on TV, so if we can capture and publish that, we’re set.”
  26. 26. Mezz-to-VOD Mezz = Mezzanine video VOD = Video On Demand !
  27. 27. The world’s largest public-service video recorder
  28. 28. “In a secure location somewhere is part of the BBC’s TV broadcast chain.” Playout; Thomson Video Networks; RTP video over multicast UDP; includes timecodes. Capture and chunk. Mustn’t miss a single packet. Upload chunks to S3. Playout; broadcast end SNS message; find relevant chunks and join Transcode with trim; distribute; publish. Talk about inaccurate trims and “Resilient broadcast-grade system”.
  29. 29. “In a secure location somewhere is part of the BBC’s TV broadcast chain.” Playout; Thomson Video Networks; RTP video over multicast UDP; includes timecodes. Capture and chunk. Mustn’t miss a single packet. Upload chunks to S3. Playout; broadcast end SNS message; find relevant chunks and join Transcode with trim; distribute; publish. Talk about inaccurate trims and “Resilient broadcast-grade system”.
  30. 30. “In a secure location somewhere is part of the BBC’s TV broadcast chain.” Playout; Thomson Video Networks; RTP video over multicast UDP; includes timecodes. Capture and chunk. Mustn’t miss a single packet. Upload chunks to S3. Playout; broadcast end SNS message; find relevant chunks and join Transcode with trim; distribute; publish. Talk about inaccurate trims and “Resilient broadcast-grade system”.
  31. 31. “In a secure location somewhere is part of the BBC’s TV broadcast chain.” Playout; Thomson Video Networks; RTP video over multicast UDP; includes timecodes. Capture and chunk. Mustn’t miss a single packet. Upload chunks to S3. Playout; broadcast end SNS message; find relevant chunks and join Transcode with trim; distribute; publish. Talk about inaccurate trims and “Resilient broadcast-grade system”.
  32. 32. What bits of AWS do we use? For legal reasons, it’s all in the EU, hence eu-west-1. EC2 compute, VPC, ELB, Autoscaling S3, SQS, SNS, SimpleDB Cloudwatch, Cloud formation
  33. 33. What bits of AWS do we use? (nothing too exciting, actually) For legal reasons, it’s all in the EU, hence eu-west-1. EC2 compute, VPC, ELB, Autoscaling S3, SQS, SNS, SimpleDB Cloudwatch, Cloud formation
  34. 34. but here’s the fun part…
  35. 35. video is big
  36. 36. SD video mpeg-ts / avc 720 x 576 9.4Mbps 25fps / mpeg audio 256Kbps
  37. 37. SD video 1.3MB/sec/channel mpeg-ts / avc 720 x 576 9.4Mbps 25fps / mpeg audio 256Kbps
  38. 38. SD video 1.3MB/sec/channel = 109 GB/day/channel mpeg-ts / avc 720 x 576 9.4Mbps 25fps / mpeg audio 256Kbps
  39. 39. SD video 1.3MB/sec/channel = 109 GB/day/channel x 21 channels mpeg-ts / avc 720 x 576 9.4Mbps 25fps / mpeg audio 256Kbps
  40. 40. SD video 1.3MB/sec/channel = 109 GB/day/channel x 21 channels = 2.3 TB/day mpeg-ts / avc 720 x 576 9.4Mbps 25fps / mpeg audio 256Kbps
  41. 41. HD video mpeg-ts / avc 1920 x 1080 ~38Mbps 25fps / mpeg audio 256Kbps
  42. 42. HD video 4.2MB/sec/channel mpeg-ts / avc 1920 x 1080 ~38Mbps 25fps / mpeg audio 256Kbps
  43. 43. HD video 4.2MB/sec/channel = 365 GB/day/channel mpeg-ts / avc 1920 x 1080 ~38Mbps 25fps / mpeg audio 256Kbps
  44. 44. HD video 4.2MB/sec/channel = 365 GB/day/channel x 8 channels mpeg-ts / avc 1920 x 1080 ~38Mbps 25fps / mpeg audio 256Kbps
  45. 45. HD video 4.2MB/sec/channel = 365 GB/day/channel x 8 channels = 2.9 TB/day mpeg-ts / avc 1920 x 1080 ~38Mbps 25fps / mpeg audio 256Kbps
  46. 46. 2.3 TB/day + 2.9 TB/day
  47. 47. 5.2 TB/day
  48. 48. 5.2 TB/day per copy
  49. 49. 5.2 TB/day/copy x 2 locations “each channel is captured in 2 physical locations” “at each location we capture 2 copies"
  50. 50. 5.2 TB/day/copy x 2 locations x 2 copies “each channel is captured in 2 physical locations” “at each location we capture 2 copies"
  51. 51. 21TB per day “… for a total of 21TB per day. Handling this much data wouldn’t have been possible on our previous platform, but with Amazon Web Services, Video Factory is able to handle this much data, all day, every day.”
  52. 52. The origin of Video Factory ! The world’s biggest public service video recorder ! ! ! “I’d like to talk now about the practices and tooling we have in place that make this possible.”
  53. 53. Tooling Continuous Integration Continuous Delivery ChaosMonkey “A tool we call stack-fetcher” Cosmos
  54. 54. Deployment weekly averages (total for 10 weeks, divided by 10) int: test: live:
  55. 55. Deployment weekly averages (total for 10 weeks, divided by 10) int: 131 test: live:
  56. 56. Deployment weekly averages (total for 10 weeks, divided by 10) int: 131 test: 37 live:
  57. 57. Deployment weekly averages (total for 10 weeks, divided by 10) int: 131 test: 37 live: 34
  58. 58. 12 10 8 6 4 2 0 Live deployments by day of week (total for 10 weeks, divided by 10) Monday Tuesday Wednesday Thursday Friday Saturday Sunday LIVE only. Why the spike on Monday?
  59. 59. 12 10 8 6 4 2 0 Live deployments by day of week (total for 10 weeks, divided by 10) Full build, etc Certificate renewal Monday Tuesday Wednesday Thursday Friday Saturday Sunday More balanced if you exclude cert renewal 3.5 deployment days per week vs 5 days per week No live deployments on Sunday :-)
  60. 60. “So, with Mezz-to-VOD in place, iPlayer is saved. We record whatever was broadcast on TV, and we publish it. So now, at the click of a button, you can enjoy world-class content, like this”
  61. 61. Oh. Did you spot the “big, big mistake, really huge” there? Broadcast isn’t a clean feed.
  62. 62. Oh. Did you spot the “big, big mistake, really huge” there? Broadcast isn’t a clean feed.
  63. 63. Channel logo + credit squeeze with audio overdub.
  64. 64. Channel logo + credit squeeze with audio overdub.
  65. 65. Wrong logo (should be non-animated BBC logo) Animated channel logo Subtitles marker Ident / content cross-fade
  66. 66. Wrong logo (should be non-animated BBC logo) Animated channel logo Subtitles marker Ident / content cross-fade
  67. 67. File-Based Delivery What is FBD - e.g. EastEnders Better than M2V: higher res, cleaner than live. Delivered before broadcast Build up an archive
  68. 68. size of archive archive is valuable, so:
  69. 69. 36000 files size of archive archive is valuable, so:
  70. 70. 36000 files 23000 hours size of archive archive is valuable, so:
  71. 71. 36000 files 23000 hours 540 TB size of archive archive is valuable, so:
  72. 72. Encrypting the archive Created queue-based system to perform encryption 14,000 files (around 210TB) to encrypt Scaled up: maxed out ec2 instances, and raised spot price
  73. 73. populated queue with 14,000 messages scaled up to lots of instances got to 440 instances and ran out!
  74. 74. c1.medium spot price in eu-west-1
  75. 75. scaled down to 400 instances to free some up down to 20 overnight up to 400 again in the day draining is hard
  76. 76. scaled down to 400 instances to free some up down to 20 overnight up to 400 again in the day draining is hard
  77. 77. Reducing costs Use case: ingest once, large files Once ingested and used, files are sometimes used again, but not in a hurry “So there are two obvious solutions to this…”
  78. 78. 1. Glacier “Here’s one possible solution” “And here’s the other”
  79. 79. 1. Glacier or 2. Glacier “Here’s one possible solution” “And here’s the other”
  80. 80. I don’t mean this: Gla-sier Glay-sier Glay-sher (and I make no apology for freely switching between these) I mean this:
  81. 81. /ˈɡlæsiə/ or /ˈɡleɪsiə/ or /ˈɡleɪʃər/ I don’t mean this: Gla-sier Glay-sier Glay-sher (and I make no apology for freely switching between these) I mean this:
  82. 82. You have to pick one. They’re incompatible. For our case, the S3 mode was by far the more convenient. But it lacks SNS notifications. So we needed a component to manage this, to make Glacier invisible to client components.
  83. 83. Glacier (the S3 storage class) or Glacier (the service in its own right) You have to pick one. They’re incompatible. For our case, the S3 mode was by far the more convenient. But it lacks SNS notifications. So we needed a component to manage this, to make Glacier invisible to client components.
  84. 84. Video Store encapsulates the glacier and encryption logic
  85. 85. ! The store interface
  86. 86. Background encryption
  87. 87. Fetching (fast - no decryption, no glacier)
  88. 88. Cache expires…
  89. 89. Fetching: - retrieve from glacier, with poll - slow decrypt - then it becomes a fast fetch
  90. 90. the list interface ! scaling - three separate ASGs all updated via the same stack parameter
  91. 91. The origin of Video Factory ! The world’s biggest public service video recorder ! File-based delivery ! “Video Factory provides video-on-demand publication for iPlayer via Mezz-to-VOD and File-Based Delivery. It delivers better quality video, faster, more reliably, and on a more scalable and more maintainable system. But let’s think bigger. Where do we go next? Here’s a short video from a speech given a few months ago by our Director-General.”
  92. 92. I’ve removed the video from this presentation to save space. You can watch the video (in the context of Lord Hall’s speech) here: https://www.youtube.com/watch?feature=player_detailpage&v=95SXJYkoWbM#t=749
  93. 93. “Imagine the possibilities. What might iPlayer become? Where is it going? Where is Video Factory going?”
  94. 94. Simulcast (“Watch Live”) Bring the handling of live IP audio and video in-house Build on the success of Video Factory VOD Mostly not in the cloud, so only a very brief overview…
  95. 95. Only the packager is in the cloud Explain what HLS (Apple) is Explain what HDS (Adobe) and Smooth (Microsoft) are Explain rewind window
  96. 96. Only the packager is in the cloud Explain what HLS (Apple) is Explain what HDS (Adobe) and Smooth (Microsoft) are Explain rewind window
  97. 97. Only the packager is in the cloud Explain what HLS (Apple) is Explain what HDS (Adobe) and Smooth (Microsoft) are Explain rewind window
  98. 98. Only the packager is in the cloud Explain what HLS (Apple) is Explain what HDS (Adobe) and Smooth (Microsoft) are Explain rewind window
  99. 99. Only the packager is in the cloud Explain what HLS (Apple) is Explain what HDS (Adobe) and Smooth (Microsoft) are Explain rewind window
  100. 100. Converging Live and On-Demand
  101. 101. Here’s the latter half of the Simulcast chain.
  102. 102. Here’s the latter half of the Simulcast chain.
  103. 103. And here’s the Mezz-to-VOD chain, fed from exactly the same video feed. But we’re transcoding again, and always late. We can do better…
  104. 104. L2V is triggered by the same event as Mezz-to-VOD. L2V is an order of magnitude faster. However L2V only does some formats (albeit they’re important ones), and doesn’t trim accurately. So we deliberately allow L2V and M2V to run in parallel; L2V will win, M2V is better.
  105. 105. “At BBC Media Services we have to handle audio - that is, radio - as well as video, so we’re creating Audio Factory.” “How does Audio Factory work?”
  106. 106. Audio Factory “At BBC Media Services we have to handle audio - that is, radio - as well as video, so we’re creating Audio Factory.” “How does Audio Factory work?”
  107. 107. Audio Factory like Video Factory, “At BBC Media Services we have to handle audio - that is, radio - as well as video, so we’re creating Audio Factory.” “How does Audio Factory work?”
  108. 108. Audio Factory like Video Factory, but without the pictures “At BBC Media Services we have to handle audio - that is, radio - as well as video, so we’re creating Audio Factory.” “How does Audio Factory work?”
  109. 109. The origin of Video Factory ! The world’s biggest public service video recorder ! File-based delivery ! Live & Audio “So as well as video on demand; there’s the simulcast chain providing live video; and live-to-VOD bridging the two together, so that programmes are playable as soon as possible. Audio - including BBC Radio - is handled almost identically to video, so at last we’re handling audio and video, live and ondemand, all in-house, using a consistent, proven set of technologies. ” ! “For the final section today I’d like to talk briefly about the importance of data.”
  110. 110. Show me the data Data is key to understanding what happened, and to making decisions.
  111. 111. Monitoring SQS: CloudWatch alarms in stacks Other CloudWatch alarms (e.g. ELBs, EC2 network, EC2 cpu) iSpy and Splunk
  112. 112. iSpy and Splunk Splunk is a 3rd party product for searching, monitoring, and analysing data. iSpy is the set of libraries and protocols we use to get the data from our applications, into Splunk. Via SNS and SQS. We use Splunk for: debugging; ad-hoc and on-demand reporting; monitoring; alerting.
  113. 113. Splunk is a 3rd party product for searching, monitoring, and analysing data. iSpy is the set of libraries and protocols we use to get the data from our applications, into Splunk. Via SNS and SQS. We use Splunk for: debugging; ad-hoc and on-demand reporting; monitoring; alerting.
  114. 114. Collecting interesting data The list: all goes into Splunk.
  115. 115. Collecting interesting data • Deployments • ChaosMonkey terminations • AutoScaling activity • CloudWatch alarm state changes • CloudTrail • CloudFormation stack changes The list: all goes into Splunk.
  116. 116. Collecting interesting data • Deployments • ChaosMonkey terminations • AutoScaling activity • CloudWatch alarm state changes • CloudTrail • CloudFormation stack changes ➜ git repository The list: all goes into Splunk.
  117. 117. Having the data to support decisions about cost, combined with the power and responsibility to act on those decisions, adds a whole extra dimension to software engineering.
  118. 118. APIs Having the data to support decisions about cost, combined with the power and responsibility to act on those decisions, adds a whole extra dimension to software engineering.
  119. 119. APIs Data Having the data to support decisions about cost, combined with the power and responsibility to act on those decisions, adds a whole extra dimension to software engineering.
  120. 120. APIs Data Decisions about cost Having the data to support decisions about cost, combined with the power and responsibility to act on those decisions, adds a whole extra dimension to software engineering.
  121. 121. “It took us just over a year to build the basic video-on-demand features of Video Factory that we had to build, to prevent iPlayer from dying. It was a completely new solution: new architecture, new code, new platform. We chose the cloud because it was more flexible, more reliable, more scalable. We chose Amazon because it was a mature cloud platform that provided the right technical and support services that we needed.”
  122. 122. MMXIV “It took us just over a year to build the basic video-on-demand features of Video Factory that we had to build, to prevent iPlayer from dying. It was a completely new solution: new architecture, new code, new platform. We chose the cloud because it was more flexible, more reliable, more scalable. We chose Amazon because it was a mature cloud platform that provided the right technical and support services that we needed.”
  123. 123. MMXIV “We moved to Continuous Integration and Continuous Delivery so that the benefits of higher quality and faster turnaround could be enjoyed by everyone - by our engineers, by the product stakeholders, by the licence-fee-paying audience. Building Mezz-to-VOD to avoid killing iPlayer was just the beginning. I’m very excited about seeing the future of Video Factory unfold on Amazon, and I hope you enjoy using iPlayer even more now that you’ve heard the story behind it. Thank you.”
  124. 124. Questions Rachel Evans rachel.evans@bbc.co.uk @rvedotrc Media Services
  125. 125. Questions Rachel Evans rachel.evans@bbc.co.uk @rvedotrc Media Services We’re hiring!
  126. 126. Questions Rachel Evans rachel.evans@bbc.co.uk @rvedotrc Media Services
  127. 127. Parts of AWS used by Video Factory EC2 & VPC AutoScaling ELB & Route53 IAM Users & Roles S3 & EBS SQS & SNS SimpleDB & RDS CloudWatch CloudFormation

×