Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Video Transcoding on Hadoop

6,318 views

Published on

Published in: Technology, Business

Video Transcoding on Hadoop

  1. 1. Video Transcoding on Hadoop P R E S E N T E D B Y S h i t a l M e h t a a n d K i s h o r e A n g a n i ⎪ J u n e 3 , 2 0 1 4 2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
  2. 2. Outline 2 2014 Hadoop Summit, San Jose, California  Video Transcoding at Yahoo  Current Architecture: (Hadoop 0.23.x)  New Requirements  Generic YARN (master / worker)
  3. 3. Video Transcoding at Yahoo
  4. 4. Video Transcoding 4 Yahoo Confidential & Proprietary  Convert source videos to standard output formats › input support • > 10 container formats • > 40 video codecs • > 60 audio codecs › output support (at various resolutions and bitrates) • mp4/h264/AAC • webm/vp8/vorbis AVI MP4 Mov 3GP FLV WebM … MP4 WebM
  5. 5. Related Jobs 5 Yahoo Confidential & Proprietary  Post Transcode enrichments › watermarking › previews › thumbnails › visual seek  Machine learning
  6. 6. Extremely Compute and I/O intensive 6 Yahoo Confidential & Proprietary  SLA is measured in multiples of source video length  FFmpeg takes between 0.5x to 5x video duration › depending on hardware / resources available › tool configuration, etc  Computation requirements are dependent on: › source and destination parameters  Job parallelism › some jobs can work on fragmented videos › many require the whole video file for optimal results
  7. 7. The Processing Job (DAG) 7 Yahoo Confidential & Proprietary job1 jobn t1 job split (DAG planning based on source video / requester) t2 … tn partial callbacks, intermediate uploads t0 start td done Download Input Video Merge, Cleanup Download Input Video Merge, Cleanup (E) Previews (E) Thumbnails (T) mp4/h264/AAC/720p (T) webm/vp8/vorbis/1080p (T) webm/vp8/vorbis/720p (E) enrichments (T) mp4/h264/AAC/1080p (T) mp4/h264/AAC/720p (T) mp4/h264/AAC/360p
  8. 8. Job Characteristics 8 Yahoo Confidential & Proprietary  Tens of thousands of input videos / day  Source duration ranges from 10 seconds to 2 hours  Video sizes vary from a few MBs to a few GBs  Variable source / output fan-out › 5 to 15 output jobs per source video › hundreds of thousands of processing tasks per day  Job split and planning at ‘t1’ › dependent on source video parameters  Static Job plan (DAGs) based approaches lead to: › high resource wastage with reduced concurrency if the DAG over provisioned › high resource contention with SLA misses when DAG plan too strict  SLA and predictability are very important
  9. 9. Current Architecture: (Hadoop 0.23.x)
  10. 10. Cascaded Map – Reduce Jobs 10 Yahoo Confidential & Proprietary MR Job MR Job OOZIE MR Job (M) Download + Split Generation Video Store HDFS MR Job MR Job MR Job (R) Cleanup, Notify (M) Transcode (M) Transcode (M) Transcode API API
  11. 11. Why Hadoop 1/2 11 Yahoo Confidential & Proprietary  Extremely reliable as a framework  Good Resource Management › custom container asks based on source video parameters › multiple 2G to 6G MR jobs spawned on demand › minimal resource wastage (job plan decided by the parent MR job)  Distributed File System (HDFS) › used to share video files between various transcode jobs  Elasticity › scaling achieved by increasing queue capacity  Fault Tolerance  OOZIE provides job level fault tolerance  MR framework provides task level fault tolerance
  12. 12. Why Hadoop 2/2 12 Yahoo Confidential & Proprietary  Log analysis and reporting › run as MR jobs alongside transcode jobs in the same queue  All functions well contained within the Hadoop MR ecosystem  Very low maintenance › over and above Grid maintenance  Lets us focus on the business logic and functions  Excellent SLA for big jobs
  13. 13. New Requirements (UGC and near real-time processing)
  14. 14. UGC and the current architecture (shortcomings) 14 Yahoo Confidential & Proprietary  Very high variance in User Generated Content › duration, size, bitrates, etc.  Users want immediate feedback › SLA very important here  Large number of short length videos (< 30 seconds)  SLAs on small videos is very high › latency in MR containers’ allocation and preparation › some latency added by OOZIE scheduling  OOZIE / MR designed for batch jobs
  15. 15. The Latency 15 Yahoo Confidential & Proprietary  Total Δt1 ~ 50 seconds to a minute, Δt2 ~ few seconds  Job split decision point important › leads to efficient resource utilization  Map Reduce framework very good for batch jobs › but not suitable for near real-time processing  Well known and documented  Alternate low latency frameworks available OOZIE MR1 Δt1 MR3 Δt1 MR2 Δt1 MR4 Δt1 t1 job split (DAG planning based on source video / requester) Δt1 Job Queuing / Scheduling Container Allocation Container Localization Δt2 Δt2 Δt2 Δt2 Container warming - (ML Models, etc)
  16. 16. New Requirements and options explored 16 Yahoo Confidential & Proprietary  Need › near real-time scheduling (Δt1) › long running re-usable containers (Δt2)  Options explored › Tez › Storm / Spark › Slider
  17. 17. Issues with options explored 17 Yahoo Confidential & Proprietary  Most (if not all) frameworks optimized for captive data flow › (in our case) only job metadata flows through the framework › while video blobs are consumed from outer subsystems (HDFS / local storage) › metadata is not a clear indicator of job characteristics  Video vs Text Processing › cannot process line by line › no key / value decomposition › many jobs require the whole video file to be present locally
  18. 18. The Comparison Sheet 18 Yahoo Confidential & Proprietary Requirement Current Tez Storm / Spark Slider Elasticity High High High High Latency High Low Low Low Resource Efficiency (usage %) High Low* High High Dynamic DAG Yes No No No DAG Fault Tolerance Framework Framework Framework Framework Resource Management Fine Fine Coarse / None Fine Job / Task Abstraction Yes Yes Yes No Container Release Yes Yes No No Container Isolation Yes Yes No Yes Container PreWarm Per Job Once Once Once * Containers remain idle as DAG cannot be changed post first step
  19. 19. New Architecture: Generic YARN (master / worker)
  20. 20. Generic YARN Master / Worker 20 Yahoo Confidential & Proprietary Master w1 Workers – (Type 1…k) … wn Jobs RPC  Extremely simple framework  Master manages a pool of workers  Master reads jobs and distributes to workers over Hadoop RPC  Framework has pluggable master and worker tasks  Pluggable scheduling strategy to manage workers  Heterogeneous worker tasks in same pool  Custom resource allocation per worker type  Worker resources setup once at bootstrap  State management is done by Master using HDFS  Security and token management by framework harness …
  21. 21. Master, Worker Interfaces 21 Yahoo Confidential & Proprietary public interface Master { Job getJobInput(String workerName); void setJobOutput(Job jobOutput); } public interface Worker { public Job execute(Job jobInput); }
  22. 22. New Architecture for Transcoding 22 Yahoo Confidential & Proprietary HDFS Pool Master w1 Worker1 … w m Client API Job Queue w1 Workerk … wn API State Information Video Storage …
  23. 23. Characteristics of the New Framework 23 Yahoo Confidential & Proprietary  Long running workers in YARN containers › configurable TTL and timeouts  Pools consists of 1 Master and multiple workers  Multiple pools are managed by the client  Multiple clients across clusters  Adaptive container allocation and release › scheduling strategy (low – high watermark based)  Significant improvements in latency › job scheduling and distribution in milliseconds  YARN and the Client provide Master fault tolerance  Master takes care of fault tolerance for workers
  24. 24. What Next … 24 Yahoo Confidential & Proprietary  Hope to release to the community soon  In-principle similar to Google containers › with a low latency Job abstraction  YARN (nice to have): › Multi dimensional scheduling › Node Labels
  25. 25. Thank You @kishore_angani @smcal75 We are hiring! Stop by Kiosk P9 or reach out to us at bigdata@yahoo-inc.com.

×