Hello everyone, I am Ryohei Suzuki from the University of Tokyo, Japan. Today, I’m going to talk about our new work, AnnoTone.
Nowadays, recording videos, making video contents, and sharing them online have become one of the casual hobbies for every people, possibly including little children.
We have a high-definition and cheap video cameras.
Computer hardware and software needed for video editing are available in a reasonable price in the market.
And also, we have YouTube, Vimeo, dailymotion and other video sharing services working as broadcasting platforms for everyone.
So, it seems that we have everything needed to enjoy creating video contents and sharing them. But we know that video editing is still difficult in spite of them. What is the major challenge?
Yes, it takes a lot of time to master the usage of video authoring tools, and improvement on the user interfaces is demanded. But in this project, we focused on another problem that, context-aware editing requires …
In this talk, we define “context-aware editing” as any type of video editing process that intensively uses the contexts of recording such as
The objective of our project is, to annotate videos with contextual information during recording to facilitate video editing for…
1st, automating & speed-up existing video editing activity 2nd, to enhance video expressions by using additional data.
In this talk, we propose …
The core ideas of our work can be summarized as follows.
Firstly, we encode … Secondly, we embed encoded annotations … Then, finally we extract the embedded …
First, let me show you how the user can use our system to annotate videos
Firstly, the user should attach a smartphone directly to a video camera, like these pictures. Then, launch an annotation-embedding application on the phone. (見せる)
During a video recording, the smartphone gathers annotation information from either user input or sensors installed on it.
The phone converts the annotation into a sequence of inaudible audio signals and transmits it from a loudspeaker,
Then, the video camera records the scene with superimposed audio watermarks.
Then, let me tell you the editing workflow with embedded annotations.
The overview of the workflow is as follows.
First, the user imports recorded footages into their PC, and load them into a video authoring software
Then, the authoring software uses the watermark extractor to obtain the embedded annotation data, then uses the data to facilitate the editing process.
Finally, the user gets edited video and the audio track without annotation signals, and by combining them, a video content is completed.
Then, let me go into the editing process in detail.
Generally, video-editing activity involves a line of pipelined simple editing processes, Starting from clipping, adding captions, animation, color correction, and so on.
And, our annotated audio track can pass through the existing pipeline as ordinary one, Because it is merely an audio data.
When one of the process in the pipeline needs annotation data for editing, It can use the provided watermark extraction API to extract annotation data from the annotated track.
And, when the annotations become needless after the process, their signals can be removed by applying an audio filter, Then the user can get a clean audio track to proceed to the following processes, for example, audio mastering.
Then, let me show you some applications of AnnoTone.
The first example introduces a concept “Record-time Editing”.
In this application, the camera operator annotates a long footage with “success / failure” information of the actor’s performance, instead of repeatedly stop and start recording when the actor makes mistake.
After recording, the software automatically extracts only the successful parts of the footage, and by combining the parts, creates a complete video.
It would be useful for recording lecture videos, which may involves a lot of mistakes of the actor.
Using GPS receiver of a smartphone, we can exploit positional information for video editing.
While recording, the smartphone periodically embeds the location data of the camera, then user can get a sequence of locations, or a path, of a footage while editing.
Such sequence can be used to create an overlaid map like the left image. In the right movie, a footage is mapped on a geographical map as a path, and the user can clip a portion of the footage by, sketching a corresponding path on the map.
This map-based clipping might be very useful for dealing with a long video with movement, such as touring video.
And, annotations can be used to create various kinds of overlaid contents.
In this application, the camera operator annotates a video of chess game with the chess note, using the notation interface shown in the left. Then, the system automatically generates the overlaid graphics of the chess board as the right movie.
We also prepared a mechanism to integrate AnnoTone with Adobe AfterEffects software.
Our plugin can provide annotation data to be used by the editing workflow of AfterEffects, to generates various effects and animations. It enables user to exploit annotation with the established editing practice.
The way of integration is as follows.
Firstly, the annotone software analyzes a footage to extract annotations
The system generates a text layer containing JSON-formatted annotation data at each timeframe.
Then, the user can associate video effects, parameters, and animations with annotations using the expression function of AfterEffects, it is a light-weight end-user scripting mechanism.
Then, let me talk about the technical detail of audio watermarking of AnnoTone.
It is known that the human’s hearing sensitivity drops with the increase of sound frequency, And most human can not hear high-frequency tones above, 17 or 18 kHz.
On the other hand, ordinary video camera with microphone can record high-frequency sounds up to 22 kHz. Therefore, we can hide information in the high-frequency range of the audio track as modulated signals.
This is a example spectrogram of an audio track of a video. This is a high-frequency region which is almost inaudible to human, And, we hide information as this.
But…, what is the benefit of using audio watermarking for data-hiding?
First of all, it is compatible with almost all video cameras, because it only requires microphone for embedding.
Second, the synchronization between annotations and the timestamp of video can be preserved throughout the editing process, due to the direct embedding on the video sequence.
Additionally, the watermark signals can be easily removed by applying low-pass filter.
Our watermarking protocol uses Dual-Tone Multi-Frequency (or DTMF) to modulate digital data into audio signals.
It represents 4-bit information per unit signal by a combination of two single tones from 7 frequencies.
Our packet representation have variable-length payload, and 400 bps gross data rate.
Then let me introduce some related work of AnnoTone
ContextCam is a special camera which can record contextual information such as location and person presence of home videos.
It stores annotation in frames of video by image watermarking technique.
Indeed it realizes context-aware video recording for a specific purpose.
However, because it is simply a special camera, this technique is not compatible with existing equipment,
Cryptone, or Ultra Sound Control protocol is an interaction technique between loudspeaker and smartphones of audience at music venues.
It uses a high-frequency modulation to convey simple information.
AnnoTone’s audio watermarking technique is very similar to that of Cryptone, and can be seen as an extension of that. But, the purpose of it is very different.
Let me show you the results of performance evaluations briefly.
Firstly, we measured the maximum data-rate of watermarking which can achieve enough reliability.
The result showed that almost 100% correct detection rate can be achieved with 400 bps annotation data rate, in four acoustic environments, they are, silent room, public street, with playing rock music and electronic music.
Next, we measured how long can watermark signals travel through air, from a smartphone speaker.
The result showed that they can travel for up to about 20 cm, and it implies that users can use flexible hardware setup to some extent, when it is difficult to directly attach a smartphone to a camera because of the shapes of them.
Thirdly, we tested the durability of watermark signals against audio format conversion, a common process in video editing.
According to the results, watermarks can be preserved after being converted into Ogg Vorbis, AC-3 and AAC formats, if enough bitrate is given.
On the other hand, if we use MP3 as the destination format, we couldn’t preserve watermarks even if the bitrate setting was very high.
Finally, we tested the transparency or imperceptibility of audio watermark signals for human ear.
We hired 6 participants, and gave them a task of clicking a button when they notice a noise while listening to an annotated audio track.
The result showed that watermark signals are not completely transparent, and especially young participant was able to notice them,
But they became almost completely transparent after applying low-pass filter.
And, we admit AnnoTone has some limitations.
First of all, it requires one-off developments of annotation-embedding applications for smartphone for embedding different types of annotations.
Secondly, the cause of audio quality loss in the process of watermark removal is inevitable because it simply uses low-pass filtering.
And, the data-rate of annotation is significantly limited, therefore if we want to annotate a video with a large amount of data, we should consider another way. For example, we can use AnnoTone’s annotations as anchors for separately recorded body annotation data.
As future work,
We are thinking about transmitting watermark signals from publicly installed speakers to annotate many videos simultaneously.
It could be used to synchronize or integrate a large number of videos recorded at a same place, such as stadium, to create new types of video contents, like multi-view videos.
Also, similar technique could be used for entertainment use at amusement parks.
Then, let me conclude the presentation.
We proposed … The benefit of the technique is that … Thank you for listening.
AnnoTone (CHI 2015)
Record-time Audio Watermarking
for Context-aware Video Editing
THE UNIVERSITY OF TOKYO
CHI 2015 @ Seoul
Session: What do I hear? Communicating with Sound
Video recording and sharing have become
casual hobbies for everyone.
Video Editing is Still Difficult
1. Cost of learning video authoring tools is high
2. Context-aware editing requires much labor
for careful review and trial-and-error
• Adding visual effects
• Clipping scenes
• Adding captions and overlays
• Using additional information (e.g., GPS)
Annotating videos with contextual information
during recording to facilitate video editing
1. Automate & speed-up video editing activity
2. Enhance expressions using additional data
In this talk, we propose
1. A video-annotation technique requiring
no special equipment
2. A video-editing workflow that exploits
contextual information for efficient editing.
• Encoding contextual information as
inaudible sound signals
• Embedding encoded annotations directly
into the audio track of video during recording
• Extracting the embedded information
while editing process on demand
Recording: information of Success/Failure
Editing: Automatic extraction of successful parts
Success Failure Success
Good! Bad! Good!
Automatic extraction & combining
Video-editing with GPS
Recording: GPS positions
Editing: location-aware editing
Clipping movie by
sketching on a map
Automatic map overlay
Recording: chess note of a game
Editing: automatic overlaying of board graphics
Notation UI Synthesized video
Integrating with AfterEffects
AnnoTone plugin provides annotation data for AE
which can be used for generating effects
Exploiting annotations with existing practice
Controlling AE animation
with sensor data
Integrating with AfterEffects
1. Analyzing footage to extract annotations
2. Generating a text layer containing JSON-
formatted annotation data at timeframe
3. Associating video effects/parameters with
annotations using expressions mechanism
JSON text layer
Human’s Hearing Characteristics
Human cannot perceive high-frequency sounds.
Sakamoto, Masayuki, et al. "Average thresholds in the 8 to 20 kHz range as a function of age.”
Scandinavian audiology 27.3 (1998): 189-192.
Data-hiding as High-frequency
We can hide information in the audio track
as high-frequency signals (audio watermarks).
Spectrogram of audio track
Data-hiding as High-frequency
Benefit of Audio Watermarking
• Compatible with almost all video cameras
• Consistent synchronization between
annotations and video sequence
• Removable by applying low-pass filter
• Dual-Tone Multi-Frequency (DTMF)
– Representing 4-bits information by combination of
two single tones from 7 frequencies
• Packet representation
– Variable-length payload
– 400 bps gross data rate
Spectrogram of a watermark packet
[Patel & Abowd, 2004]
Incompatible with existing video cameras.
Using special camera to record contexts of home videos
Storing annotations in frames by image watermarking
Cryptone (Ultra Sound Control)
[Hirabayashi & Shimizu, 2012]
AnnoTone uses similar audio data-hiding method
for video editing support.
Interaction between loudspeaker and smartphones
using high-frequency tones to convey information
667 571 500 444 400 364
Gross bitrate (bps)
Data-rate vs. Reliability
~100% correct detection rate was achieved
with 400 bps annotation data rate.
Watermark signal can travel up to 20cm
through air from a smartphone speaker 35
0 5 10 15 20 25 30
speaker and microphone (cm)
Durability against Conversion
Watermarks are preserved after conversion into
Ogg Vorbis, AC-3 and AAC with enough bitrate.
128 192 256 320
Bit rate (kbps)
Transparency for Human Ear
Measured noticeability of watermarks for human
• Click a button after notice of noise (6 participants)
silent public rock electronic
• One-off development of
• Audio quality loss in watermark removal
• Limited data-rate of annotation
Embedding from Public Speaker
• Synchronization & integration of large number
of videos to create multi-view videos, etc.
• Entertainment use at amusement parks, etc.
“Sleeping Beauty Castle at Disneyland” by Lyght
Licensed under CC BY-SA 3.0
“Picture of Stadium” by Jazza5
Licensed under CC BY-SA 3.0
a video annotation technique using audio watermarking,
and a video-editing workflow exploiting annotations.
AnnoTone can facilitate and enhance non-professional
video editing process without special equipment.
Some smartphone camera apps can record
annotation as metadata format (e.g., Adobe XMP)
– Of course, using such apps is clever for smartphone
What’s AnnoTone’s superiority?
• Dedicated video cameras are still superior to
– In resolution, definition, lens quality, etc.
• No need of dealing with external metadata
– Because annotations are directly embedded as sound