SlideShare a Scribd company logo
What is H.264?

• Video compression standard

• Official name: Advanced Video Coding (AVC) for generic
  audiovisual services
   o aka: MPEG-4/Part 10 or MPEG-4 AVC
• It's in your iPod
   o Current generation standardized format
   o Compression efficiency: H.264 >> XviD and DivX
How H.264 Compresses Video

     Frame 1        Frame 2         Frame 3         Frame 4        Frame 5




  Spatial
                          Temporal        <Source: Foreman, QCIF @ 25 fps>
Redundancy
                         Redundancy
    • Three redundancy reduction principles:
       1. Spatial redundancy (Intra-frame prediction)
       2. Temporal redundancy (Inter-frame prediction)
       3. Entropy coding (Mapping more common symbols to shorter codes)
Simple Video Encoder
Intra-frame Prediction
• Prediction block is formed from previously encoded blocks in
  the same frame
• Use spatial similarities to compress each frame
   o Use neighboring pixels to make a prediction on a block
   o Transmit the difference between actual and predicted
   o Tradeoff: prediction accuracy vs. # control bits
• Compression efficiency is relatively low in most areas of a
  typical scene

• Relatively low computation cost




                             Divide into 16x16 macroblocks (MBs)
Inter-frame Prediction

• Temporal locality
• Use previous frame as prediction for current frame
• Record movements
   o "motion vectors" (MVs)
Motion Vectors
Motion Estimation Algorithms

• Block Matching
   o 16 pixel x 16 pixel macroblocks
   o Estimate the movement of each macroblock
• Phase Correlation
   o Perform the search in the frequency domain
   o Only works well for translational motion
• Bayesian methods
tree moved down people moved farther to
                        and to the right the right than tree




Frame 1 (reference)             Frame 2 (current)




                          Macroblock to be coded
Big (Computational) Problem
• HD Video- 1080p (1920×1080) = 8,160 macroblocks
• Search window-how far we search for original block
  o   Normally 16 pixels; sometimes 32 pixels
  o   (2*16+1)*(2*16+1) = 1089 positions




                                          ME block

            Reference                                Current
            Frame          Search                    Frame
                           Space
Profiling Results

• Motion estimation (ME) dominates the encoding time!




  Results from JM H.264 Reference
  Code
Amdahl's Law

• Limits the overall speedup
• Eventually, the speedup limited by unparallized portion of
  the code
   o Optimized ME implementation (like x264) generally
     results in lower overall speedup
Previous Implementations

• x264
   o CPU
   o Open source
   o C and hand-coded assembly
   o VERY optimized
       MMX, SSE2, SSE3, SSE4
   o Considered the fastest implementation of H.264
   o Multithreaded (pthread support)
   o Slow! Slower than last generation encoders.
In CUDA
     • Several published articles which implemented H.264
       encoder in CUDA.
     • All of them target ME for parallelization
     • An example*
        o ME = 5 kernels
        o Full-search (i.e., unoptimized ME)
        o Sub-pel MV support
        o Sub-partition support




* Wei-Nien Chen; Hsueh-Ming Hang, "H.264/AVC motion estimation implmentation on Compute Unified Device Architecture (CUDA)," Multimedia and Expo, 2008
IEEE International Conference on, pp.697-700, June 23 2008-April 26 2008.
Problems with Previous Work

• Do not address inter-block dependencies
  o Sacrifice quality for parallelizability (i.e. speed)




                     MVp Dependencies
Our Project

• H.264 specifies how the decoder will work
   o Flexibility in encoder
       e.g. other CUDA implementations
• Solve motion estimation problem in parallel
   1.Deal with the dependency between blocks
   2.Best guess of MVp
Direct Approach: Wavefront
Our Approach: Pyramid ME

• Also known as "Hierarchical" ME
• Perform ME at a number of resolutions in increasing order
   o Use the MV found at the higher level as an estimate of
     the MVp in the lower level
Motion Vector


Sub-sampled 16x
Using Pyramid ME to Solve MVp Problem
Our Prototyping Framework

• Originally MATLAB + nvmex
• Now pyCUDA + matplotlib
• Motivation
  o Simplicity
  o Flexibility (output images, graphs, etc.)
  o pyCUDA == awesome
  o Automatic tuning in the future
Our Prototyping Framework
Our CUDA Implementation

• CUDA + C
• One kernel / level of hierarchy
• One block per macroblock
• One thread per search position
   o With 512 thread limit, search window size <= 11
   o Can perform argmin reduction to find the best MV
• Texture memory for reference and current frame
   o Allows for sub-pixel interpolation
   o Handles border clamping
Results

Gold    203.3 msec
CUDA    3.6 msec        Speedup = 56
x264    11.6 msec

• Not appropriate to compare the CUDA time to the x264 time.
• The x264 is performing a more accurate search.
   o The CUDA implementation will be made more accurate in
     the future.
   o We implemented small subset of the ME features
Conclusions

• H.264 ME in CUDA is viable, but will not be easy
   o Competing against very well written CPU code
• Full encoding process of H.264 is very complicated
   o Complex control flow and data dependencies
Future Work

• Improve estimate for MVp
• Pipeline data transfers
• Downsample on GPU vs. CPU
   o Data access concerns
• Process multiple frames together
   o Improve occupancy
• More than ME in CUDA
   o More dependency constraints
CUDA as a Development Framework

• Opened up GPU
   o Took less than a month!
• Documentation is sparse
• Right way isn't always known
• Debugging is a pain
• Emulation mode is VERY slow
• CUDA servers can become locked and need rebooting
Acknowledgements

Dark_Shikari (x264 dev)
Various other people in #x264 channel @ Freenode.net
H.264 Encoder Block Diagram

                                                                                                Bitstream
Video Input                    +                     Transform &                      Entropy
                                                                                                Output
                                                     Quantization                     Coding
                                      -
                                                               Inverse Quantization
                                                               & Inverse Transform



                             Intra/Inter Mode
                                 Decision
                                                                    + +

                 Motion                        Intra
              Compensation                  Prediction



                                                 Picture            Deblocking
                                                Buffering             Filter

                Motion
               Estimation
                                                                     Block prediction
References

E. G. Richardson, Iain (2003). H.264 and MPEG-4 Video Compression: Video Coding for Next-generation
Multimedia. Chichester: John Wiley & Sons Ltd..

Wei-Nien Chen; Hsueh-Ming Hang, "H.264/AVC motion estimation implmentation on Compute Unified
Device Architecture (CUDA)," Multimedia and Expo, 2008 IEEE International Conference on, pp.697-700,
June 23 2008-April 26 2008.

S Ryoo, CI Rodrigues, SS Baghsorkhi, SS Stone, DB."Optimization Principles and Application Performance
Evaluation of a Multithreaded GPU Using CUDA" 2008.

http://www.cs.cf.ac.uk/Dave/Multimedia/node256.html

http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/AV0405/ZAMPOGLU/Hierarchicalestimation.h
tml

More Related Content

What's hot

LF_DPDK17_mediated devices: better userland IO
LF_DPDK17_mediated devices: better userland IOLF_DPDK17_mediated devices: better userland IO
LF_DPDK17_mediated devices: better userland IO
LF_DPDK
 
h.264 video compression standard.
h.264 video compression standard.h.264 video compression standard.
h.264 video compression standard.
Videoguy
 

What's hot (20)

RISC-V 30908 patra
RISC-V 30908 patraRISC-V 30908 patra
RISC-V 30908 patra
 
RISC-V Zce Extension
RISC-V Zce ExtensionRISC-V Zce Extension
RISC-V Zce Extension
 
Andes open cl for RISC-V
Andes open cl for RISC-VAndes open cl for RISC-V
Andes open cl for RISC-V
 
H.264 video standard
H.264 video standardH.264 video standard
H.264 video standard
 
Andes andes clarity for risc-v vector processor
Andes andes clarity for risc-v vector processorAndes andes clarity for risc-v vector processor
Andes andes clarity for risc-v vector processor
 
SemiDynamics new family of High Bandwidth Vector-capable Cores
SemiDynamics new family of High Bandwidth Vector-capable CoresSemiDynamics new family of High Bandwidth Vector-capable Cores
SemiDynamics new family of High Bandwidth Vector-capable Cores
 
LAS16-405:OpenDataPlane: Software Defined Dataplane leader
LAS16-405:OpenDataPlane: Software Defined Dataplane leaderLAS16-405:OpenDataPlane: Software Defined Dataplane leader
LAS16-405:OpenDataPlane: Software Defined Dataplane leader
 
REDA services
REDA servicesREDA services
REDA services
 
RISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor FamilyRISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor Family
 
RISC-V assembly
RISC-V assemblyRISC-V assembly
RISC-V assembly
 
LF_DPDK17_mediated devices: better userland IO
LF_DPDK17_mediated devices: better userland IOLF_DPDK17_mediated devices: better userland IO
LF_DPDK17_mediated devices: better userland IO
 
h.264 video compression standard.
h.264 video compression standard.h.264 video compression standard.
h.264 video compression standard.
 
P4 to OpenDataPlane Compiler - BUD17-304
P4 to OpenDataPlane Compiler - BUD17-304P4 to OpenDataPlane Compiler - BUD17-304
P4 to OpenDataPlane Compiler - BUD17-304
 
Closing the RISC-V compliance gap via fuzzing
Closing the RISC-V compliance gap via fuzzingClosing the RISC-V compliance gap via fuzzing
Closing the RISC-V compliance gap via fuzzing
 
RISC-V Linker Relaxation and LLD
RISC-V Linker Relaxation and LLDRISC-V Linker Relaxation and LLD
RISC-V Linker Relaxation and LLD
 
Semi dynamics high bandwidth vector capable RISC-V cores
Semi dynamics high bandwidth vector capable RISC-V coresSemi dynamics high bandwidth vector capable RISC-V cores
Semi dynamics high bandwidth vector capable RISC-V cores
 
Memory ECC - The Comprehensive of SEC-DED.
Memory ECC - The Comprehensive of SEC-DED. Memory ECC - The Comprehensive of SEC-DED.
Memory ECC - The Comprehensive of SEC-DED.
 
Hard IP Core design | Convolution Encoder
Hard IP Core design | Convolution EncoderHard IP Core design | Convolution Encoder
Hard IP Core design | Convolution Encoder
 
Secure IoT Firmware for RISC-V
Secure IoT Firmware for RISC-VSecure IoT Firmware for RISC-V
Secure IoT Firmware for RISC-V
 
Open j9 jdk on RISC-V
Open j9 jdk on RISC-VOpen j9 jdk on RISC-V
Open j9 jdk on RISC-V
 

Similar to H 264 in cuda presentation

Video Compression Basics by sahil jain
Video Compression Basics by sahil jainVideo Compression Basics by sahil jain
Video Compression Basics by sahil jain
Sahil Jain
 
Introduction to Video Compression Techniques - Anurag Jain
Introduction to Video Compression Techniques - Anurag JainIntroduction to Video Compression Techniques - Anurag Jain
Introduction to Video Compression Techniques - Anurag Jain
Videoguy
 
Emerging H.264 Standard: Overview and TMS320DM642- Based ...
Emerging H.264 Standard: Overview and TMS320DM642- Based ...Emerging H.264 Standard: Overview and TMS320DM642- Based ...
Emerging H.264 Standard: Overview and TMS320DM642- Based ...
Videoguy
 
Scrambling For Video Surveillance
Scrambling For Video SurveillanceScrambling For Video Surveillance
Scrambling For Video Surveillance
Kobi Magnezi
 
Emerging H.264 Standard:
Emerging H.264 Standard:Emerging H.264 Standard:
Emerging H.264 Standard:
Videoguy
 
Generic Video Adaptation Framework Towards Content – and Context Awareness in...
Generic Video Adaptation Framework Towards Content – and Context Awareness in...Generic Video Adaptation Framework Towards Content – and Context Awareness in...
Generic Video Adaptation Framework Towards Content – and Context Awareness in...
Alpen-Adria-Universität
 

Similar to H 264 in cuda presentation (20)

Aruna Ravi - M.S Thesis
Aruna Ravi - M.S ThesisAruna Ravi - M.S Thesis
Aruna Ravi - M.S Thesis
 
Video Compression Basics by sahil jain
Video Compression Basics by sahil jainVideo Compression Basics by sahil jain
Video Compression Basics by sahil jain
 
Introduction to Video Compression Techniques - Anurag Jain
Introduction to Video Compression Techniques - Anurag JainIntroduction to Video Compression Techniques - Anurag Jain
Introduction to Video Compression Techniques - Anurag Jain
 
Compression: Video Compression (MPEG and others)
Compression: Video Compression (MPEG and others)Compression: Video Compression (MPEG and others)
Compression: Video Compression (MPEG and others)
 
Emerging H.264 Standard: Overview and TMS320DM642- Based ...
Emerging H.264 Standard: Overview and TMS320DM642- Based ...Emerging H.264 Standard: Overview and TMS320DM642- Based ...
Emerging H.264 Standard: Overview and TMS320DM642- Based ...
 
MPEG-1 Part 2 Video Encoding
MPEG-1 Part 2 Video EncodingMPEG-1 Part 2 Video Encoding
MPEG-1 Part 2 Video Encoding
 
Deblocking_Filter_v2
Deblocking_Filter_v2Deblocking_Filter_v2
Deblocking_Filter_v2
 
Scrambling For Video Surveillance
Scrambling For Video SurveillanceScrambling For Video Surveillance
Scrambling For Video Surveillance
 
Moving object detection on FPGA
Moving object detection on FPGAMoving object detection on FPGA
Moving object detection on FPGA
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Emerging H.264 Standard:
Emerging H.264 Standard:Emerging H.264 Standard:
Emerging H.264 Standard:
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Machine Learning approaches at video compression
Machine Learning approaches at video compression Machine Learning approaches at video compression
Machine Learning approaches at video compression
 
HEVC intra coding
HEVC intra codingHEVC intra coding
HEVC intra coding
 
Generic Video Adaptation Framework Towards Content – and Context Awareness in...
Generic Video Adaptation Framework Towards Content – and Context Awareness in...Generic Video Adaptation Framework Towards Content – and Context Awareness in...
Generic Video Adaptation Framework Towards Content – and Context Awareness in...
 
HEVC VIDEO CODEC By Vinayagam Mariappan
HEVC VIDEO CODEC By Vinayagam MariappanHEVC VIDEO CODEC By Vinayagam Mariappan
HEVC VIDEO CODEC By Vinayagam Mariappan
 
Cuda project paper
Cuda project paperCuda project paper
Cuda project paper
 
An Introduction to Versatile Video Coding (VVC) for UHD, HDR and 360 Video
An Introduction to  Versatile Video Coding (VVC) for UHD, HDR and 360 VideoAn Introduction to  Versatile Video Coding (VVC) for UHD, HDR and 360 Video
An Introduction to Versatile Video Coding (VVC) for UHD, HDR and 360 Video
 
Video Compression Standards - History & Introduction
Video Compression Standards - History & IntroductionVideo Compression Standards - History & Introduction
Video Compression Standards - History & Introduction
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 

Recently uploaded

Recently uploaded (20)

AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 

H 264 in cuda presentation

  • 1. What is H.264? • Video compression standard • Official name: Advanced Video Coding (AVC) for generic audiovisual services o aka: MPEG-4/Part 10 or MPEG-4 AVC • It's in your iPod o Current generation standardized format o Compression efficiency: H.264 >> XviD and DivX
  • 2. How H.264 Compresses Video Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 Spatial Temporal <Source: Foreman, QCIF @ 25 fps> Redundancy Redundancy • Three redundancy reduction principles: 1. Spatial redundancy (Intra-frame prediction) 2. Temporal redundancy (Inter-frame prediction) 3. Entropy coding (Mapping more common symbols to shorter codes)
  • 4. Intra-frame Prediction • Prediction block is formed from previously encoded blocks in the same frame • Use spatial similarities to compress each frame o Use neighboring pixels to make a prediction on a block o Transmit the difference between actual and predicted o Tradeoff: prediction accuracy vs. # control bits • Compression efficiency is relatively low in most areas of a typical scene • Relatively low computation cost Divide into 16x16 macroblocks (MBs)
  • 5. Inter-frame Prediction • Temporal locality • Use previous frame as prediction for current frame • Record movements o "motion vectors" (MVs)
  • 7. Motion Estimation Algorithms • Block Matching o 16 pixel x 16 pixel macroblocks o Estimate the movement of each macroblock • Phase Correlation o Perform the search in the frequency domain o Only works well for translational motion • Bayesian methods
  • 8. tree moved down people moved farther to and to the right the right than tree Frame 1 (reference) Frame 2 (current) Macroblock to be coded
  • 9. Big (Computational) Problem • HD Video- 1080p (1920×1080) = 8,160 macroblocks • Search window-how far we search for original block o Normally 16 pixels; sometimes 32 pixels o (2*16+1)*(2*16+1) = 1089 positions ME block Reference Current Frame Search Frame Space
  • 10. Profiling Results • Motion estimation (ME) dominates the encoding time! Results from JM H.264 Reference Code
  • 11. Amdahl's Law • Limits the overall speedup • Eventually, the speedup limited by unparallized portion of the code o Optimized ME implementation (like x264) generally results in lower overall speedup
  • 12. Previous Implementations • x264 o CPU o Open source o C and hand-coded assembly o VERY optimized  MMX, SSE2, SSE3, SSE4 o Considered the fastest implementation of H.264 o Multithreaded (pthread support) o Slow! Slower than last generation encoders.
  • 13. In CUDA • Several published articles which implemented H.264 encoder in CUDA. • All of them target ME for parallelization • An example* o ME = 5 kernels o Full-search (i.e., unoptimized ME) o Sub-pel MV support o Sub-partition support * Wei-Nien Chen; Hsueh-Ming Hang, "H.264/AVC motion estimation implmentation on Compute Unified Device Architecture (CUDA)," Multimedia and Expo, 2008 IEEE International Conference on, pp.697-700, June 23 2008-April 26 2008.
  • 14. Problems with Previous Work • Do not address inter-block dependencies o Sacrifice quality for parallelizability (i.e. speed) MVp Dependencies
  • 15. Our Project • H.264 specifies how the decoder will work o Flexibility in encoder  e.g. other CUDA implementations • Solve motion estimation problem in parallel 1.Deal with the dependency between blocks 2.Best guess of MVp
  • 17. Our Approach: Pyramid ME • Also known as "Hierarchical" ME • Perform ME at a number of resolutions in increasing order o Use the MV found at the higher level as an estimate of the MVp in the lower level
  • 19. Using Pyramid ME to Solve MVp Problem
  • 20. Our Prototyping Framework • Originally MATLAB + nvmex • Now pyCUDA + matplotlib • Motivation o Simplicity o Flexibility (output images, graphs, etc.) o pyCUDA == awesome o Automatic tuning in the future
  • 22. Our CUDA Implementation • CUDA + C • One kernel / level of hierarchy • One block per macroblock • One thread per search position o With 512 thread limit, search window size <= 11 o Can perform argmin reduction to find the best MV • Texture memory for reference and current frame o Allows for sub-pixel interpolation o Handles border clamping
  • 23. Results Gold 203.3 msec CUDA 3.6 msec Speedup = 56 x264 11.6 msec • Not appropriate to compare the CUDA time to the x264 time. • The x264 is performing a more accurate search. o The CUDA implementation will be made more accurate in the future. o We implemented small subset of the ME features
  • 24. Conclusions • H.264 ME in CUDA is viable, but will not be easy o Competing against very well written CPU code • Full encoding process of H.264 is very complicated o Complex control flow and data dependencies
  • 25. Future Work • Improve estimate for MVp • Pipeline data transfers • Downsample on GPU vs. CPU o Data access concerns • Process multiple frames together o Improve occupancy • More than ME in CUDA o More dependency constraints
  • 26. CUDA as a Development Framework • Opened up GPU o Took less than a month! • Documentation is sparse • Right way isn't always known • Debugging is a pain • Emulation mode is VERY slow • CUDA servers can become locked and need rebooting
  • 27. Acknowledgements Dark_Shikari (x264 dev) Various other people in #x264 channel @ Freenode.net
  • 28. H.264 Encoder Block Diagram Bitstream Video Input + Transform & Entropy Output Quantization Coding - Inverse Quantization & Inverse Transform Intra/Inter Mode Decision + + Motion Intra Compensation Prediction Picture Deblocking Buffering Filter Motion Estimation Block prediction
  • 29. References E. G. Richardson, Iain (2003). H.264 and MPEG-4 Video Compression: Video Coding for Next-generation Multimedia. Chichester: John Wiley & Sons Ltd.. Wei-Nien Chen; Hsueh-Ming Hang, "H.264/AVC motion estimation implmentation on Compute Unified Device Architecture (CUDA)," Multimedia and Expo, 2008 IEEE International Conference on, pp.697-700, June 23 2008-April 26 2008. S Ryoo, CI Rodrigues, SS Baghsorkhi, SS Stone, DB."Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA" 2008. http://www.cs.cf.ac.uk/Dave/Multimedia/node256.html http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/AV0405/ZAMPOGLU/Hierarchicalestimation.h tml