Hello everyone, I am Ryohei Suzuki from the University of Tokyo, Japan.
Today, I’m going to talk about our new work, AnnoTone.
Nowadays, recording videos, making video contents, and sharing them online have become one of the casual hobbies for every people, possibly including little children.
We have a high-definition and cheap video cameras.
Computer hardware and software needed for video editing are available in a reasonable price in the market.
And also, we have YouTube, Vimeo, dailymotion and other video sharing services working as broadcasting platforms for everyone.
So, it seems that we have everything needed to enjoy creating video contents and sharing them.
But we know that video editing is still difficult in spite of them.
What is the major challenge?
Yes, it takes a lot of time to master the usage of video authoring tools, and improvement on the user interfaces is demanded.
But in this project, we focused on another problem that, context-aware editing requires …
In this talk, we define “context-aware editing” as any type of video editing process that intensively uses the contexts of recording such as
The objective of our project is, to annotate videos with contextual information during recording to facilitate video editing for…
1st, automating & speed-up existing video editing activity
2nd, to enhance video expressions by using additional data.
In this talk, we propose …
The core ideas of our work can be summarized as follows.
Firstly, we encode …
Secondly, we embed encoded annotations …
Then, finally we extract the embedded …
First, let me show you how the user can use our system to annotate videos
Firstly, the user should attach a smartphone directly to a video camera, like these pictures.
Then, launch an annotation-embedding application on the phone.
(見せる)
During a video recording, the smartphone gathers annotation information from either user input or sensors installed on it.
The phone converts the annotation into a sequence of inaudible audio signals and transmits it from a loudspeaker,
Then, the video camera records the scene with superimposed audio watermarks.
Then, let me tell you the editing workflow with embedded annotations.
The overview of the workflow is as follows.
First, the user imports recorded footages into their PC, and load them into a video authoring software
Then, the authoring software uses the watermark extractor to obtain the embedded annotation data, then uses the data to facilitate the editing process.
Finally, the user gets edited video and the audio track without annotation signals, and by combining them, a video content is completed.
Then, let me go into the editing process in detail.
Generally, video-editing activity involves a line of pipelined simple editing processes,
Starting from clipping, adding captions, animation, color correction, and so on.
And, our annotated audio track can pass through the existing pipeline as ordinary one,
Because it is merely an audio data.
When one of the process in the pipeline needs annotation data for editing,
It can use the provided watermark extraction API to extract annotation data from the annotated track.
And, when the annotations become needless after the process, their signals can be removed by applying an audio filter,
Then the user can get a clean audio track to proceed to the following processes, for example, audio mastering.
Then, let me show you some applications of AnnoTone.
The first example introduces a concept “Record-time Editing”.
In this application, the camera operator annotates a long footage with “success / failure” information of the actor’s performance, instead of repeatedly stop and start recording when the actor makes mistake.
After recording, the software automatically extracts only the successful parts of the footage, and by combining the parts, creates a complete video.
It would be useful for recording lecture videos, which may involves a lot of mistakes of the actor.
Using GPS receiver of a smartphone, we can exploit positional information for video editing.
While recording, the smartphone periodically embeds the location data of the camera, then user can get a sequence of locations, or a path, of a footage while editing.
Such sequence can be used to create an overlaid map like the left image.
In the right movie, a footage is mapped on a geographical map as a path, and the user can clip a portion of the footage by, sketching a corresponding path on the map.
This map-based clipping might be very useful for dealing with a long video with movement, such as touring video.
And, annotations can be used to create various kinds of overlaid contents.
In this application, the camera operator annotates a video of chess game with the chess note, using the notation interface shown in the left.
Then, the system automatically generates the overlaid graphics of the chess board as the right movie.
We also prepared a mechanism to integrate AnnoTone with Adobe AfterEffects software.
Our plugin can provide annotation data to be used by the editing workflow of AfterEffects, to generates various effects and animations.
It enables user to exploit annotation with the established editing practice.
The way of integration is as follows.
Firstly, the annotone software analyzes a footage to extract annotations
The system generates a text layer containing JSON-formatted annotation data at each timeframe.
Then, the user can associate video effects, parameters, and animations with annotations using the expression function of AfterEffects, it is a light-weight end-user scripting mechanism.
Then, let me talk about the technical detail of audio watermarking of AnnoTone.
It is known that the human’s hearing sensitivity drops with the increase of sound frequency,
And most human can not hear high-frequency tones above, 17 or 18 kHz.
On the other hand, ordinary video camera with microphone can record high-frequency sounds up to 22 kHz.
Therefore, we can hide information in the high-frequency range of the audio track as modulated signals.
This is a example spectrogram of an audio track of a video.
This is a high-frequency region which is almost inaudible to human,
And, we hide information as this.
But…, what is the benefit of using audio watermarking for data-hiding?
First of all, it is compatible with almost all video cameras, because it only requires microphone for embedding.
Second, the synchronization between annotations and the timestamp of video can be preserved throughout the editing process, due to the direct embedding on the video sequence.
Additionally, the watermark signals can be easily removed by applying low-pass filter.
Our watermarking protocol uses Dual-Tone Multi-Frequency (or DTMF) to modulate digital data into audio signals.
It represents 4-bit information per unit signal by a combination of two single tones from 7 frequencies.
Our packet representation have variable-length payload, and 400 bps gross data rate.
Then let me introduce some related work of AnnoTone
ContextCam is a special camera which can record contextual information such as location and person presence of home videos.
It stores annotation in frames of video by image watermarking technique.
Indeed it realizes context-aware video recording for a specific purpose.
However, because it is simply a special camera, this technique is not compatible with existing equipment,
Cryptone, or Ultra Sound Control protocol is an interaction technique between loudspeaker and smartphones of audience at music venues.
It uses a high-frequency modulation to convey simple information.
AnnoTone’s audio watermarking technique is very similar to that of Cryptone, and can be seen as an extension of that.
But, the purpose of it is very different.
Let me show you the results of performance evaluations briefly.
Firstly, we measured the maximum data-rate of watermarking which can achieve enough reliability.
The result showed that almost 100% correct detection rate can be achieved with 400 bps annotation data rate, in four acoustic environments, they are, silent room, public street, with playing rock music and electronic music.
Next, we measured how long can watermark signals travel through air, from a smartphone speaker.
The result showed that they can travel for up to about 20 cm, and it implies that users can use flexible hardware setup to some extent, when it is difficult to directly attach a smartphone to a camera because of the shapes of them.
Thirdly, we tested the durability of watermark signals against audio format conversion, a common process in video editing.
According to the results, watermarks can be preserved after being converted into Ogg Vorbis, AC-3 and AAC formats, if enough bitrate is given.
On the other hand, if we use MP3 as the destination format, we couldn’t preserve watermarks even if the bitrate setting was very high.
Finally, we tested the transparency or imperceptibility of audio watermark signals for human ear.
We hired 6 participants, and gave them a task of clicking a button when they notice a noise while listening to an annotated audio track.
The result showed that watermark signals are not completely transparent, and especially young participant was able to notice them,
But they became almost completely transparent after applying low-pass filter.
And, we admit AnnoTone has some limitations.
First of all, it requires one-off developments of annotation-embedding applications for smartphone for embedding different types of annotations.
Secondly, the cause of audio quality loss in the process of watermark removal is inevitable because it simply uses low-pass filtering.
And, the data-rate of annotation is significantly limited, therefore if we want to annotate a video with a large amount of data, we should consider another way.
For example, we can use AnnoTone’s annotations as anchors for separately recorded body annotation data.
As future work,
We are thinking about transmitting watermark signals from publicly installed speakers to annotate many videos simultaneously.
It could be used to synchronize or integrate a large number of videos recorded at a same place, such as stadium, to create new types of video contents, like multi-view videos.
Also, similar technique could be used for entertainment use at amusement parks.
Then, let me conclude the presentation.
We proposed …
The benefit of the technique is that …
Thank you for listening.