Video encoding uses various techniques to compress video files in a lossy manner. It involves representing color information using RGB or YCbCr color spaces, sampling and quantizing signals to convert them to digital form, using the Fourier transform to analyze signal frequencies, windowing to divide signals for transform analysis, inter-frame encoding to remove redundancy between frames, and intra-frame encoding to remove redundancy within frames. Key compression techniques include motion compensation between inter-coded frames and periodic insertion of intra-coded frames.
발표자: 정준선 (옥스포드대 박사, 현 NAVER)
발표일: 2018.3.
The objective of this work is visual recognition of human communications.
Solving this problem opens up a host of applications, such as transcribing archival silent films, or resolving multi-talker simultaneous speech,
but most importantly it helps to advance the state of the art in speech recognition by enabling machines to take advantage of the multi-modal nature of human communications.
Training a deep learning algorithm requires a lot of training data.
We propose a method to automatically collect, process and generate a large-scale audio-visual corpus from television videos temporally aligned with the transcript.
To build such dataset, it is essential to know 'who' is speaking 'when'.
We develop a ConvNet model that learns joint embedding of the sound and the mouth images from unlabelled data, and apply this network to the tasks of audio-to-video synchronisation and active speaker detection.
We also show that the methods developed here can be extended to the problem of generating talking faces from audio and still images, and re-dubbing videos with audio samples from different speakers.
We then propose a number of deep learning models that are able to recognise visual speech at sentence level.
The lip reading performance beats a professional lip reader on videos from BBC television.
We demonstrate that if audio is available, then visual information helps to improve speech recognition performance.
We also propose methods to enhance noisy audio and to resolve multi-talker simultaneous speech using visual cues.
Finally, we explore the problem of speaker recognition.
Whereas previous works for speaker identification have been limited to constrained conditions, here we build a new large-scale speaker recognition dataset collected from 'in the wild' videos using an automated pipeline. We propose a number of ConvNet architectures that outperforms traditional baselines on this dataset.
Since the birth of RMM in 1995, Semiconductor IP activities had grown from a pure R&D initiative into a significant factor of many SoC project budget.
For the last 25 the major growth of the SIP market was driven by Microprocessor, Memory and I/O vendors.
In recent years the landscape of SIP is changing and new horizons appear brought by new IP vendors representing variety of business models.
In this panel discussion, each of the speakers will have 5 – 10 minute to present its SIP commercialization practice and vision.
The second half of the talk will be dedicated for 3 – 4 questions that should be answer by each of the speakers, allowing the audience to evaluate the different opinions.
발표자: 정준선 (옥스포드대 박사, 현 NAVER)
발표일: 2018.3.
The objective of this work is visual recognition of human communications.
Solving this problem opens up a host of applications, such as transcribing archival silent films, or resolving multi-talker simultaneous speech,
but most importantly it helps to advance the state of the art in speech recognition by enabling machines to take advantage of the multi-modal nature of human communications.
Training a deep learning algorithm requires a lot of training data.
We propose a method to automatically collect, process and generate a large-scale audio-visual corpus from television videos temporally aligned with the transcript.
To build such dataset, it is essential to know 'who' is speaking 'when'.
We develop a ConvNet model that learns joint embedding of the sound and the mouth images from unlabelled data, and apply this network to the tasks of audio-to-video synchronisation and active speaker detection.
We also show that the methods developed here can be extended to the problem of generating talking faces from audio and still images, and re-dubbing videos with audio samples from different speakers.
We then propose a number of deep learning models that are able to recognise visual speech at sentence level.
The lip reading performance beats a professional lip reader on videos from BBC television.
We demonstrate that if audio is available, then visual information helps to improve speech recognition performance.
We also propose methods to enhance noisy audio and to resolve multi-talker simultaneous speech using visual cues.
Finally, we explore the problem of speaker recognition.
Whereas previous works for speaker identification have been limited to constrained conditions, here we build a new large-scale speaker recognition dataset collected from 'in the wild' videos using an automated pipeline. We propose a number of ConvNet architectures that outperforms traditional baselines on this dataset.
Since the birth of RMM in 1995, Semiconductor IP activities had grown from a pure R&D initiative into a significant factor of many SoC project budget.
For the last 25 the major growth of the SIP market was driven by Microprocessor, Memory and I/O vendors.
In recent years the landscape of SIP is changing and new horizons appear brought by new IP vendors representing variety of business models.
In this panel discussion, each of the speakers will have 5 – 10 minute to present its SIP commercialization practice and vision.
The second half of the talk will be dedicated for 3 – 4 questions that should be answer by each of the speakers, allowing the audience to evaluate the different opinions.
This talk was a lecture on Social cognition and music at the International Summer School of Systematic, Comparative and Cognitive Musicology, on August 2009, in Jyväskylä, Finland.
This talk was a lecture on Social cognition and music at the International Summer School of Systematic, Comparative and Cognitive Musicology, on August 2009, in Jyväskylä, Finland.
Data compression has increased by leaps and bounds over the years due to technical innovation, enabling the proliferation of streamed digital multimedia and voice over IP. For example, a regular cadence of technical advancement in video codecs has led to massive reduction in file size – in fact, up to a 1000x reduction in file size when comparing a raw video file to a VVC encoded file. However, with the rise of machine learning techniques and diverse data types to compress, AI may be a compelling tool for next-generation compression, offering a variety of benefits over traditional techniques. In this presentation we discuss:
- Why the demand for improved data compression is growing
- Why AI is a compelling tool for compression in general
- Qualcomm AI Research’s latest AI voice and video codec research
- Our future AI codec research work and challenges
4. Video encoding: basic principles
Color coding
Human eye Colour x Luminance perception
5. Video encoding: basic principles
Color coding
Human eye Colour x Luminance perception
R (8 bits) G (8 bits) B (8 bits)
Each color is coded separately
Y (8 bits) Cb (4 bits) Cr (4 bits)
Y : Luminance
Cb : Blue color
Cr : Red color
Green color is presense of luminance and
absence of Blue and Red color
14. Video encoding: basic principles
Fourrier Transform
The transform must consider the complete signal history
to get the exact frequencies in the signal.
To apply the transform we must known the signal
behavior since -∞ to +∞
Is it possible ?
And, what if the signal behaves like this :
15. Video encoding: basic principles
Windowing
The windowing must be applied in the signal before
the Fourrier transform, to focalize the analysis
16. Video encoding: basic principles
Windowing
The windowing can be used to divide the signal
in small pieces, and transform them separately
18. Video encoding: basic principles
Windowing
The Heisenberg uncertainty principle states that:
the knowledge of the position of a particle is inversely
proportional to the knowledge of its energy
It is the same to say:
knowledge about time is inversely proportional to
knowledge about frequency
Position knowledge is relative to time
Energy knowledge is related to frequency
25. Video encoding: basic principles
Fourrier Transform in a image
This picture is the cover of book: MPEG-2 , John Watkinson , Focal Press
26. Video encoding: basic principles
Wavelet transform
Wavlet dont use endless sine wave functions as its basis,
but instead, use functions that are finite on time axis.
The window lenght is variable and is inversely proportional
to the frequency.
High frequencies are transformed with short basis functions
and therefore are accurately located. Low frequencies are
transformed with long basis functions which have good frequency
resolution.
28. Video encoding: basic principles
Frame subdivision
Subdivision of a Frame into blocks and super blocks
Each color plane has its own set of blocks and super blocks
29. Video encoding: basic principles
Intra Frame
Intra-coding explores redundancy within a picture
30. Video encoding: basic principles
Inter Frame
Inter-coding explores redundancy between pictures
31. Video encoding: basic principles
Inter Frame
Golden Frame (intra)
Inter Frames
Inter Frames
Coded frame