AWS re:Invent 2013 Scalable Media Processing in the Cloud

  • 167 views
Uploaded on

Presentation from AWS re:Invent 2013. See session video here: http://www.youtube.com/watch?v=MjZdiDotRU8 …

Presentation from AWS re:Invent 2013. See session video here: http://www.youtube.com/watch?v=MjZdiDotRU8
Presentation is in two parts: (1) Introduction to moving workloads to the cloud, (2) deep dive on how the BBC moved their playout to the cloud.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
167
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
8
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Media here refers to video and audio content. Maybe you’re a media and entertainment company or build apps and websites that work with user generated content.
  • Want to get a feel for the audience.Raise your hand if you do media processing in the cloud today.Raise your hand if you’re a developer.OK for those of you who are developers, have a nap and Phil will wake you up with a video in a few minutes.
  • Start by talking about media workflows. Main point is there are many workflows.Use media workflows to go from what’s on the the left to what’s on the right.Steps themselves are generally pretty straightforward.Industry trends that are making workflows more complex:More content: at the pro end look at all the content on the left. On the consumer end, everyone is carrying around a 1080p camcorder. And the more content there is, the greater the opportunity to monetize it.Bigger content: the industry is moving to some combination of more pixels, faster pixels and better pixels. More pixels: 4K and beyond (4x pixels compared to 1080p). Faster pixels: higher frame rates. 48fps is 2x current cinema frame rate. Better pixels: higher dynamic range and brighter pixels, increase bit-depth.More processing: the amount of processing going up not down. At the high end, whether it is a commercial, a TV show or a movie, most shows contain visual effects. Even in corporate video, color correction is becoming a standard part of the workflow. And at the consumer level, all those Instagram like filters require processing.More output formats: not just renditions based on devices but also versions. ? One senior industry figure recently told me that a piece of finished content will have been converted 1000 times!So all of these trends have an impact on workflows especially when you factor in constrained budgets and timeframes.
  • To give you context for what follows in Phil’s session, I thought I’d cover where AWS fits and then some approaches we’ve seen for doing media processing at scale in the cloud.As you know, AWS provides infrastructure services: compute, networking, database, storage and delivery and so on. We also provide application services and deployment and management services. Using these services, as your “software defined datacenter”, you can build media processing workflows.Typical operations in a media workflow would run on top of the AWS services. These operations could be provided by software that you’ve developed or they might be from another vendor like Aspera for ingest or Tektronix for video QC.On top of all that you’d have media applications – perhaps an Online Video Platform, a production management application, a digital dailies system or visual effects.So that’s where AWS fits. Now let’s look at some approaches for doing media processing on AWS.
  • A useful way to think about any kind of processing in the cloud is that there are 3 phases or approaches.
  • The first phase is simply taking what you do today and deploying it on AWS. This is the way a lot of people get started.
  • You take your on-premise deployment on the left and run it on EC2. Your media processing operation runs on an operating system and storage, both of which are provided by EC2. You can spin up multiple instances of these and that’s a way to give you scale and/or redundancy.But let’s look closer at this “lift and shift” approach.
  • Let’s break open that media processing operation black box and see what’s inside. What we find are discrete operations only one of which is the actual media processing operation – for example transcoding or scaling or feature extraction.So is there perhaps an opportunity to break apart the black box and derive some benefit?
  • That brings us to phase 2, which is about refactoring – or breaking things apart and putting them back together again in a different way – and optimizing your media processing operation. By doing this you might find ways to better use some of the features of AWS because we give you a lot of fantastic services for doing things like automatically scaling or distributing jobs or storing objects.
  • The cornerstone of phase 2 is to break apart monolithic operations. In our black box, we had these operations. Do they all need to happen inside one logical unit? Probably not. Are there benefits to breaking them apart? Absolutely. Why have each EC2 instance do its own ingest? Why have workflow that is an island?
  • So here’s a refactored example. What hasn’t changed is that we have our media processing operation – but only the operation itself – taking place on EC2 instances. But now we’ve using S3 to store the input content and the output content. Maybe we’ve used Aspera or some other ingest technology to get the content there. Then we’re using Simple Workflow to manage the workflow operations across the various EC2 instances and we’re using APIs to have each element talk to the other. This lets us use the scale of S3 and SWF so that you don’t need to worry about it. Also instead of having a handful of EC2 instances running the monolithic application, we can have a fleet of instances running the essential media processing operation – decoupled from the rest of the workflow – and the external workflow engine will send the media processing job to the appropriate instance. So if an instance has a problem, the job won’t go there giving you better resiliency. If an instance dies, another one can spin up automatically giving you redundancy.
  • The third phase builds on the second phase and decomposes your architecture still further. You’re now at the point where you are primarily writing or wrappering very atomic pieces of code that perform specific operations and leverage the AWS infrastructure for everything else.
  • Some ways to do this are to decouple everything: you want to understand which parts of the architecture need to know about the implementation details of another part. Chances are that they do not. You also want to make sure that if an operation fails somewhere, the job itself does not get lost and this is where workflow management and queues come in. You also want to design your components so that when you instantiate them, they figure out what they are supposed to do. For example, you might have a media processing worker that starts up an queries what kind of instance type it is running on so that it knows how much work it can do or if there are additional capabilities that it can advertise to the rest of the system.This is a good time to think about how you are managing the attributes that you really care about in your system.For capacity, where are the bottlenecks, what can you do when you need to overcome them?For redundancy, how do you make sure that each of your components are redundant?Is latency a concern? For many media processing operations, it probably is. So how can you manage that, reduce it an make it predictable?Are you architecting security into every component and layer of your system?So that concludes my brief overview of approaches to running media processing workloads on AWS.Now I’d like to welcomePhilCluff, the team lead for taking the BBC iPlayer video service into the cloud. He’s going to show you how they moved their broadcast playout to VOD system into AWS to give them scalability, reliability and elasticity.
  • Introduction:Phil CluffPrincipal Software Engineer & Team Lead @ BBC Media ServicesBeen with BBC for 3 ½ years, focused in Transcode architectures, Message Orientated Middleware & Reliable, Distributed systems in the cloud!I’m going to talk to you about BBC iPlayer and our journey into the Cloud.
  • Hopefully you’ve all heard of the BBC, but you may not have all heard of iPlayer.So What is BBC iPlayer?UK online population is about 40m which is the size of the state of California.
  • Now we’ll watch a short video produced by the BBC Director General, Tony Hall, which shows you where iPlayer has come from, and where we see It going in the future.
  • As I said, I’m here to talk to you about Video Factory.So what Video Factory?Read Slide plus:“We actually started building Video Factory 1 year ago this week – I was putting together the final designs for our transcode architecture before I flew out to re:Invent this time, last year”
  • As I said, I’m here to talk to you about Video Factory.So what Video Factory?Read Slide plus:“We actually started building Video Factory 1 year ago this week – I was putting together the final designs for our transcode architecture before I flew out to re:Invent this time, last year”
  • Old:Designed with a very ambitiousthroughput in mind, 5 years ago, but industry has moved on – new devices, delivery methods, throughput increases.New:Full control to deploy & manage our applications, and change quickly in a changing marketplace
  • Regional OPTs:18 channels, all on at once, 6 days a weekWant to transcode them all at the same time, but not to have those encoders hanging around idle at other timesPreviously have taken 9-12 hours for the queue to move through our systemIt’s news content – People want it while its still relevantNew system designed to cope with this (and more) throughput spikes
  • Be really clear on Mezzanine definition since next 4 slides depend on it.Mention Mez video capture is classic broadcast technologies.Make note of the “Time addressable media store”
  • We’re going to look at two areas in detail – Mez capture & transcode abstraction.
  • On-Premise encoders produce MPEG2 Transport Streams from SDI onto RTP MulticastCapture RTP and split into ChunksUpload Chunks to an S3 BucketRe-Construct Chunks only when required for Transcode
  • Vendor Lock in particularly important in SAAS models. I suggest you always have several options.
  • So let’s take a look inside our transcode abstraction layer
  • Blah blah…So let’s take a look inside an example transcode backend and think about how we might build one.
  • Mention that the transaction runs as long as the transcode – Camel renews
  • Give a one sentence summary of Camel.Give an overview ofBDD, TDD & Cucumber.Why is continuous deployment important.What happens if a deployment goes pear shaped?
  • We use several, the key concept in all of these is that you never loose a message
  • The message doesn’t un-marshal to the JaxB object it should. E.g.Not XMLDifferent type of messageWe could un-marshal the object, but it doesn’t meet our validation rules. E.g.Source must not be nullWrapped in a message wrapper which containsOriginal Message (Escaped)Exception MessageNever retriedAlways requires developer level interventionSuggests component version mismatchVery rare in production systemsSometimes caused by humans manually crafting messagesImplemented as an exception handler on the Route Builder
  • We tried processing the message a number of times, and something went wrong each time that we weren’t expecting. E.g.Dependent system is downNetwork connectivity issues(Frequently) “Completely unexpected code path”Message is an exact copy of the input messageCan be replayed directly onto the input queueMore detail about what caused it can be found in the Eventing framework (Splunk)Retried several times before being put on the DLQ3 – 5 is common24/7 Operations level interventionUsually to fix the dependent system, and then replay messagesCan be common, even in production systemsBut suggests you may need to improve dependent systems, or increase your retry countImplemented as a bean in the Route Builder for SQSCheck “Approximate Delivery Count” before attempting to do any processing on a message, and redirect the message to the DLQ if necessaryOr broker side (E.g. ActiveMQ)
  • Something I was expecting to go wrong, went wrong. E.g.State of a Dependent system wasn’t what is requiredA command line tool I use returned non-zeroBut I think the tool is likely dependable (IE a retry won’t help)Wrapped in a message wrapper which containsOriginal Message (Escaped)Exception MessageRequires some level of knowledge of the system to be retried24/7 Operations level intervention with RunBook, or Second Line supportWe have a console which un-wraps the message and replays itOften evolve from understanding the causes of DLQ’d messagesImplemented as an exception handler on the Route Builder

Transcript

  • 1. Scalable Media Processing Phil Cluff, British Broadcasting Corporation David Sayed, Amazon Web Services November 13, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 2. Agenda • • • • Media workflows Where AWS fits Cloud media processing approaches BBC iPlayer in the cloud
  • 3. Media Workflows Archive Featurettes Networks Interviews Media Workflow 2D Movie 3D Movie Archive Materials Stills Theatrical DVD/BD Media Workflow Media Workflow Online MSOs Mobile Apps
  • 4. Where AWS Fits Into Media Processing Analytics and Monetization Amazon Web Services Playback Track Auth. Protect Package QC Process Index Ingest Media Asset Management
  • 5. Media Processing Approaches 3 Phases
  • 6. Cloud Media Processing Approaches Phase 1: Lift processing from the premises and shift to the cloud
  • 7. Lift and Shift Media Processing Operation OS Media Processing Operation OS Storage EC2 Storage Media Processing Operation OS EC2 Storage
  • 8. The Problem with Lift and Shift Monolithic Media Processing Operation OS EC2 Storage Ingest Operation Postprocessing Export Workflow Media Processing Operation Parameters
  • 9. Cloud Media Processing Approaches: Phase 2 Phase 1: Lift processing from the premises and shift to the cloud Phase 2: Refactor and optimize to leverage cloud resources
  • 10. Refactor and Optimization Opportunities “Deconstruct monolithic media processing operations” – – – – – – Ingest Atomic media processing operation Post-processing Export Workflow Parameters
  • 11. Refactoring and Optimization Example EBS EC2 EBS EC2 EBS API Calls EC2 Source S3 Bucket SWF Output S3 Bucket
  • 12. Cloud Media Processing Approaches Phase 1: Lift processing from the premises and shift to the cloud Phase 2: Refactor and optimize to leverage cloud resources Phase 3: Decomposed, mo dular cloud-native architecture
  • 13. Decomposition and Modularization Ideas for Media Processing • Decouple *everything* that is not part of atomic media processing operation • Use managed services where possible for workflow, queues, databases, etc. • Manage – – – – Capacity Redundancy Latency Security
  • 14. in the Cloud AKA “Video Factory” Phil Cluff Principal Software Engineer & Team Lead BBC Media Services
  • 15. Sources: BBC iPlayer Performance Pack August 2013 http://www.bbc.co.uk/blogs/internet/posts/Video-Factory • The UK’s biggest video & audio on-demand service – And it’s free! • Over 7 million requests every day – ~2% of overall consumption of BBC output • Over 500 unique hours of content every week – Available immediately after broadcast, for at least 7 days • Available on over 1000 devices including – PC, iOS, Android, Windows Phone, Smart TVs, Cable Boxes… • Both streaming and download (iOS, Android, PC) • 20 million app downloads to date
  • 16. Video “Where Next?”
  • 17. What Is Video Factory? • Complete in-house rebuild of ingest, transcode, and delivery workflows for BBC iPlayer • Scalable, message-driven cloud-based architecture • The result of 1 year of development by ~18 engineers
  • 18. And here they are!
  • 19. Why Did We Build Video Factory? • Old system – – – – Monolithic Slow Couldn’t cope with spikes Mixed ownership with third party • Video Factory – Highly scalable, reliable – Completely elastic transcode resource – Complete ownership
  • 20. Why Use the Cloud? • Background of 6 channels, spikes up to 24 channels, 6 days a week • A perfect pattern for an elastic architecture Off-Air Transcode Requests for 1 week
  • 21. Video Factory – Architecture • Entirely message driven – Amazon Simple Queuing Service (SQS) • Some Amazon Simple Notification Service (SNS) – We use lots of classic message patterns • ~20 small components – Singular responsibility – “Do one thing, and do it well” • Share libraries if components do things that are alike • Control bloat – Components have contracts of behavior • Easy to test
  • 22. Video Factory – Workflow SDI Broadcast Video Feed Amazon Elastic Transcoder x 24 Broadcast Encoder SMPTE Timecode RTP Chunker Playout Video Amazon S3 Mezzanine Time Addressable Media Store Mezzanine Video Capture Mezzanine Elemental Cloud Live Ingest Logic Transcoded Video Metadata Playout Data Feed Transcode Abstraction Layer DRM QC Editorial Clipping MAM Amazon S3 Distribution Renditions
  • 23. Detail • Mezzanine video capture • Transcode abstraction • Eventing demonstration
  • 24. Mezzanine Video Capture
  • 25. Mezzanine Capture SDI Broadcast Video Feed x 24 3 GB HD/1 GB SD SMPTE Timecode Broadcast Grade Encoder MPEG2 Transport Stream (H.264) on RTP Multicast 30 MB HD/10 MB SD RTP Chunker MPEG2 Transport Stream (H.264) Chunks Chunk Concatenator Chunk Uploader Amazon S3 Mezzanine Chunks Control Messages Amazon S3 Mezzanine
  • 26. Concatenating Chunks • Build file using Amazon S3 multipart requests – 10 GB Mezzanine file constructed in under 10 seconds • Amazon S3 multipart APIs are very helpful – Component only makes REST API calls • Small instances; still gives very high performance • Be careful – Amazon S3 isn’t immediately consistent when dealing with multipart built files – Mitigated with rollback logic in message-based applications
  • 27. By Numbers – Mezzanine Capture • 24 channels – 6 HD, 18 SD – 16 TB of Mezzanine data every day per capture • 200,000 chunks every day – And Amazon S3 has never lost one – That’s ~2 (UK) billion RTP packets every day… per capture • Broadcast grade resiliency – Several data centers / 2 copies each
  • 28. Transcode Abstraction
  • 29. Transcode Abstraction • Abstract away from single supplier – – – Avoid vendor lock in Choose suppliers based on performance and quality and broadcaster-friendly feature sets BBC: Elemental Cloud (GPU), Amazon Elastic Transcoder, in-house for subtitles • Smart routing & smart bundling – – Save money on non–time critical transcode Save time & money by bundling together “like” outputs • Hybrid cloud friendly – Route a baseline of transcode to local encoders, and spike to cloud • Who has the next game changer?
  • 30. Transcode Abstraction Subtitle Extraction Backend Transcode Request SQS Transcode Router SQS Amazon Elastic Transcoder Backend Amazon Elastic Transcoder REST Elemental Backend Elemental Cloud Amazon S3 Mezzanine Amazon S3 Distribution Renditions
  • 31. Transcode Abstraction - Future Subtitle Extraction Backend Transcode Request SQS Transcode Router SQS Amazon Elastic Transcoder Backend Amazon Elastic Transcoder REST Elemental Backend Elemental Cloud Unknown Future Backend X ? Amazon S3 Mezzanine Amazon S3 Distribution Renditions
  • 32. Example – A Simple Elastic Transcoder Backend Amazon Elastic Transcoder XML Transcode Request Get Message from Queue POST Unmarshal and Validate Message Initialize Transcode SQS Message Transaction POST (Via SNS) XML Transcode Status Message Wait for SNS Callback over HTTP
  • 33. Example – Add Error Handling Amazon Elastic Transcoder XML Transcode Request Get Message from Queue Dead Letter Queue POST Unmarshal and Validate Message Initialize Transcode Bad Message Queue SQS Message Transaction POST (Via SNS) XML Transcode Status Message Wait for SNS Callback over HTTP Fail Queue
  • 34. Example – Add Monitoring Eventing Amazon Elastic Transcoder XML Transcode Request POST Get Message from Queue Unmarshal and Validate Message Monitoring Events Monitoring Events Dead Letter Queue Initialize Transcode Monitoring Events Bad Message Queue SQS Message Transaction POST (Via SNS) XML Transcode Status Message Wait for SNS Callback over HTTP Monitoring Events Fail Queue
  • 35. BBC eventing framework • Key-value pairs pushed into Splunk – Business-level events, e.g.: • Message consumed • Transcode started – System-level events, e.g.: • HTTP call returned status 404 • Application’s heap size • Unhandled exception • Fixed model for “context” data – Identifiable workflows, grouping of events; transactions – Saves us a LOT of time diagnosing failures
  • 36. Component Development – General Development & Architecture • Java applications – – – • Run inside Apache Tomcat on m1.small EC2 instances Run at least 3 of everything Autoscale on queue depth Built on top of the Apache Camel framework – – – A platform for build message-driven applications Reliable, well-tested SQS backend Camel route builders Java DSL • Full of messaging patterns • Developed with Behavior-Driven Development (BDD) & Test-Driven Development (TDD) – • Cucumber Deployed continuously – Many times a day, 5 days a week
  • 37. Error Handling Messaging Patterns • We use several message patterns – Bad message queue – Dead letter queue – Fail queue • Key concept – Never lose a message – Message is either in-flight, done, or in an error queue somewhere • All require human intervention for the workflow to continue – Not necessarily a bad thing
  • 38. Message Patterns – Bad Message Queue The message doesn’t unmarshal to the object it should OR We could unmarshal the object, but it doesn’t meet our validation rules • • • • Wrapped in a message wrapper which contains context Never retried Very rare in production systems Implemented as an exception handler on the route builder
  • 39. Message Patterns – Dead Letter Queue We tried processing the message a number of times, and something we weren’t expecting went wrong each time • • • • Message is an exact copy of the input message Retried several times before being put on the DLQ Can be common, even in production systems Implemented as a bean in the route builder for SQS
  • 40. Message Patterns – Fail Queue Something I knew could go wrong went wrong • • • • Wrapped in a message wrapper that contains context Requires some level of knowledge of the system to be retried Often evolve from understanding the causes of DLQ’d messages Implemented as an exception handler on the route builder
  • 41. Demonstration – Eventing Framework
  • 42. Questions? philip.cluff@bbc.co.uk dsayed@amazon.com @GeneticGenesis @dsayed
  • 43. Please give us your feedback on this presentation MED302 As a thank you, we will select prize winners daily for completed surveys!