AnnoTone (CHI 2015)


Published on

Ryohei Suzuki, Daisuke Sakamoto and Takeo Igarashi
"AnnoTone: Record-time Audio Watermarking for Context-aware Video Editing"
Talked at CHI 2015 Seoul

Published in: Science
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Hello everyone, I am Ryohei Suzuki from the University of Tokyo, Japan.
    Today, I’m going to talk about our new work, AnnoTone.
  • Nowadays, recording videos, making video contents, and sharing them online have become one of the casual hobbies for every people, possibly including little children.
  • We have a high-definition and cheap video cameras.

    Computer hardware and software needed for video editing are available in a reasonable price in the market.

    And also, we have YouTube, Vimeo, dailymotion and other video sharing services working as broadcasting platforms for everyone.
  • So, it seems that we have everything needed to enjoy creating video contents and sharing them.
    But we know that video editing is still difficult in spite of them.
    What is the major challenge?

    Yes, it takes a lot of time to master the usage of video authoring tools, and improvement on the user interfaces is demanded.
    But in this project, we focused on another problem that, context-aware editing requires …

    In this talk, we define “context-aware editing” as any type of video editing process that intensively uses the contexts of recording such as

  • The objective of our project is, to annotate videos with contextual information during recording to facilitate video editing for…

    1st, automating & speed-up existing video editing activity
    2nd, to enhance video expressions by using additional data.
  • In this talk, we propose …
  • The core ideas of our work can be summarized as follows.

    Firstly, we encode …
    Secondly, we embed encoded annotations …
    Then, finally we extract the embedded …

  • First, let me show you how the user can use our system to annotate videos
  • Firstly, the user should attach a smartphone directly to a video camera, like these pictures.
    Then, launch an annotation-embedding application on the phone.
  • During a video recording, the smartphone gathers annotation information from either user input or sensors installed on it.

    The phone converts the annotation into a sequence of inaudible audio signals and transmits it from a loudspeaker,

    Then, the video camera records the scene with superimposed audio watermarks.
  • Then, let me tell you the editing workflow with embedded annotations.
  • The overview of the workflow is as follows.

    First, the user imports recorded footages into their PC, and load them into a video authoring software

    Then, the authoring software uses the watermark extractor to obtain the embedded annotation data, then uses the data to facilitate the editing process.

    Finally, the user gets edited video and the audio track without annotation signals, and by combining them, a video content is completed.

    Then, let me go into the editing process in detail.
  • Generally, video-editing activity involves a line of pipelined simple editing processes,
    Starting from clipping, adding captions, animation, color correction, and so on.
  • And, our annotated audio track can pass through the existing pipeline as ordinary one,
    Because it is merely an audio data.
  • When one of the process in the pipeline needs annotation data for editing,
    It can use the provided watermark extraction API to extract annotation data from the annotated track.
  • And, when the annotations become needless after the process, their signals can be removed by applying an audio filter,
    Then the user can get a clean audio track to proceed to the following processes, for example, audio mastering.
  • Then, let me show you some applications of AnnoTone.
  • The first example introduces a concept “Record-time Editing”.

    In this application, the camera operator annotates a long footage with “success / failure” information of the actor’s performance, instead of repeatedly stop and start recording when the actor makes mistake.

    After recording, the software automatically extracts only the successful parts of the footage, and by combining the parts, creates a complete video.

    It would be useful for recording lecture videos, which may involves a lot of mistakes of the actor.
  • Using GPS receiver of a smartphone, we can exploit positional information for video editing.

    While recording, the smartphone periodically embeds the location data of the camera, then user can get a sequence of locations, or a path, of a footage while editing.

    Such sequence can be used to create an overlaid map like the left image.
    In the right movie, a footage is mapped on a geographical map as a path, and the user can clip a portion of the footage by, sketching a corresponding path on the map.

    This map-based clipping might be very useful for dealing with a long video with movement, such as touring video.
  • And, annotations can be used to create various kinds of overlaid contents.

    In this application, the camera operator annotates a video of chess game with the chess note, using the notation interface shown in the left.
    Then, the system automatically generates the overlaid graphics of the chess board as the right movie.
  • We also prepared a mechanism to integrate AnnoTone with Adobe AfterEffects software.

    Our plugin can provide annotation data to be used by the editing workflow of AfterEffects, to generates various effects and animations.
    It enables user to exploit annotation with the established editing practice.
  • The way of integration is as follows.

    Firstly, the annotone software analyzes a footage to extract annotations

    The system generates a text layer containing JSON-formatted annotation data at each timeframe.

    Then, the user can associate video effects, parameters, and animations with annotations using the expression function of AfterEffects, it is a light-weight end-user scripting mechanism.
  • Then, let me talk about the technical detail of audio watermarking of AnnoTone.
  • It is known that the human’s hearing sensitivity drops with the increase of sound frequency,
    And most human can not hear high-frequency tones above, 17 or 18 kHz.
  • On the other hand, ordinary video camera with microphone can record high-frequency sounds up to 22 kHz.
    Therefore, we can hide information in the high-frequency range of the audio track as modulated signals.
  • This is a example spectrogram of an audio track of a video.
    This is a high-frequency region which is almost inaudible to human,
    And, we hide information as this.
  • But…, what is the benefit of using audio watermarking for data-hiding?

    First of all, it is compatible with almost all video cameras, because it only requires microphone for embedding.

    Second, the synchronization between annotations and the timestamp of video can be preserved throughout the editing process, due to the direct embedding on the video sequence.

    Additionally, the watermark signals can be easily removed by applying low-pass filter.
  • Our watermarking protocol uses Dual-Tone Multi-Frequency (or DTMF) to modulate digital data into audio signals.

    It represents 4-bit information per unit signal by a combination of two single tones from 7 frequencies.

    Our packet representation have variable-length payload, and 400 bps gross data rate.
  • Then let me introduce some related work of AnnoTone
  • ContextCam is a special camera which can record contextual information such as location and person presence of home videos.

    It stores annotation in frames of video by image watermarking technique.

    Indeed it realizes context-aware video recording for a specific purpose.

    However, because it is simply a special camera, this technique is not compatible with existing equipment,
  • Cryptone, or Ultra Sound Control protocol is an interaction technique between loudspeaker and smartphones of audience at music venues.

    It uses a high-frequency modulation to convey simple information.

    AnnoTone’s audio watermarking technique is very similar to that of Cryptone, and can be seen as an extension of that.
    But, the purpose of it is very different.
  • Let me show you the results of performance evaluations briefly.
  • Firstly, we measured the maximum data-rate of watermarking which can achieve enough reliability.

    The result showed that almost 100% correct detection rate can be achieved with 400 bps annotation data rate, in four acoustic environments, they are, silent room, public street, with playing rock music and electronic music.
  • Next, we measured how long can watermark signals travel through air, from a smartphone speaker.

    The result showed that they can travel for up to about 20 cm, and it implies that users can use flexible hardware setup to some extent, when it is difficult to directly attach a smartphone to a camera because of the shapes of them.
  • Thirdly, we tested the durability of watermark signals against audio format conversion, a common process in video editing.

    According to the results, watermarks can be preserved after being converted into Ogg Vorbis, AC-3 and AAC formats, if enough bitrate is given.

    On the other hand, if we use MP3 as the destination format, we couldn’t preserve watermarks even if the bitrate setting was very high.
  • Finally, we tested the transparency or imperceptibility of audio watermark signals for human ear.

    We hired 6 participants, and gave them a task of clicking a button when they notice a noise while listening to an annotated audio track.

    The result showed that watermark signals are not completely transparent, and especially young participant was able to notice them,

    But they became almost completely transparent after applying low-pass filter.
  • And, we admit AnnoTone has some limitations.

    First of all, it requires one-off developments of annotation-embedding applications for smartphone for embedding different types of annotations.

    Secondly, the cause of audio quality loss in the process of watermark removal is inevitable because it simply uses low-pass filtering.

    And, the data-rate of annotation is significantly limited, therefore if we want to annotate a video with a large amount of data, we should consider another way.
    For example, we can use AnnoTone’s annotations as anchors for separately recorded body annotation data.
  • As future work,
  • We are thinking about transmitting watermark signals from publicly installed speakers to annotate many videos simultaneously.

    It could be used to synchronize or integrate a large number of videos recorded at a same place, such as stadium, to create new types of video contents, like multi-view videos.

    Also, similar technique could be used for entertainment use at amusement parks.
  • Then, let me conclude the presentation.
  • We proposed …
    The benefit of the technique is that …
    Thank you for listening.
  • AnnoTone (CHI 2015)

    1. 1. AnnoTone: Record-time Audio Watermarking for Context-aware Video Editing RYOHEI SUZUKI DAISUKE SAKAMOTO TAKEO IGARASHI THE UNIVERSITY OF TOKYO CHI 2015 @ Seoul Session: What do I hear? Communicating with Sound 1
    2. 2. Video recording and sharing have become casual hobbies for everyone. 2
    3. 3. Camera Computer Software Broadcasting 3
    4. 4. Video Editing is Still Difficult 4 Why? 1. Cost of learning video authoring tools is high 2. Context-aware editing requires much labor for careful review and trial-and-error • Adding visual effects • Clipping scenes • Adding captions and overlays • Using additional information (e.g., GPS)
    5. 5. Our Objective Annotating videos with contextual information during recording to facilitate video editing 5 1. Automate & speed-up video editing activity 2. Enhance expressions using additional data
    6. 6. In this talk, we propose 1. A video-annotation technique requiring no special equipment 2. A video-editing workflow that exploits contextual information for efficient editing. 6
    7. 7. Core Ideas • Encoding contextual information as inaudible sound signals • Embedding encoded annotations directly into the audio track of video during recording • Extracting the embedded information while editing process on demand 7
    8. 8. Annotation Embedding with Smartphone 8
    9. 9. 1. Hardware Setup • Attach smartphone to video camera • Launch annotation-embedding application Attaching Launching application 9
    10. 10. 2. Video Recording • Gathering annotation from user input or sensors • Converting them into inaudible audio signals User Input Sensors Scene Annotation Signals 10
    11. 11. Editing Workflow with Embedded Annotations 11
    12. 12. Workflow Overview 12 • Extract embedded annotation from audio track • Remove annotation signals after editing
    13. 13. Editing Pipeline Generally, video-editing involves a line of pipelined processes. Adding Captions & Effects Color Correction Clipping… … 13
    14. 14. Editing Pipeline Annotated audio track can pass through the existing pipeline as ordinary one. Adding Captions & Effects Color Correction Clipping… … 14
    15. 15. Annotation Extraction Adding Captions & Effects Color Correction Clipping… … 15 Annotation data is extracted on demand using our Watermark Extraction API Watermark Extractor Annotation Data
    16. 16. Annotation Removal Adding Captions & Effects Audio Mastering Clipping… … 16 After the process, annotation signals can be removed by applying an audio filter. Audio Filter
    17. 17. Applications 17
    18. 18. Record-time Editing Recording: information of Success/Failure Editing: Automatic extraction of successful parts Recording Success Failure Success Good! Bad! Good! Success Success Automatic extraction & combining (time) 18
    19. 19. Video-editing with GPS 19 Recording: GPS positions Editing: location-aware editing Clipping movie by sketching on a map Automatic map overlay
    20. 20. Automatic Overlaying 20 Recording: chess note of a game Editing: automatic overlaying of board graphics Notation UI Synthesized video 20
    21. 21. Integrating with AfterEffects AnnoTone plugin provides annotation data for AE which can be used for generating effects Exploiting annotations with existing practice 21 Controlling AE animation with sensor data
    22. 22. Integrating with AfterEffects 1. Analyzing footage to extract annotations 2. Generating a text layer containing JSON- formatted annotation data at timeframe 3. Associating video effects/parameters with annotations using expressions mechanism 22 Footage Effect control (Javascript) JSON text layer [{x: 138.0019, y: 38.13840}, {x: 139.0133, y: 38.43405}]…
    23. 23. Annotation by Audio Watermarking 23
    24. 24. Human’s Hearing Characteristics Human cannot perceive high-frequency sounds. Sakamoto, Masayuki, et al. "Average thresholds in the 8 to 20 kHz range as a function of age.” Scandinavian audiology 27.3 (1998): 189-192. 24
    25. 25. Data-hiding as High-frequency Audio Signals 25 Frequency(Hz) 20 20k 22k 18k High-frequency Range Recordable Range Audible Range We can hide information in the audio track as high-frequency signals (audio watermarks). Microphone Human
    26. 26. Spectrogram of audio track High-frequency region (almost inaudible) 26 Data-hiding as High-frequency Audio Signals Hidden information
    27. 27. Benefit of Audio Watermarking 27 • Compatible with almost all video cameras • Consistent synchronization between annotations and video sequence • Removable by applying low-pass filter
    28. 28. Watermarking Protocol 28 • Dual-Tone Multi-Frequency (DTMF) – Representing 4-bits information by combination of two single tones from 7 frequencies • Packet representation – Variable-length payload – 400 bps gross data rate Spectrogram of a watermark packet
    29. 29. Related Work 29
    30. 30. ContextCam [Patel & Abowd, 2004] Incompatible with existing video cameras. Using special camera to record contexts of home videos Storing annotations in frames by image watermarking 30
    31. 31. Cryptone (Ultra Sound Control) [Hirabayashi & Shimizu, 2012] AnnoTone uses similar audio data-hiding method for video editing support. 01001 11010 Interaction between loudspeaker and smartphones using high-frequency tones to convey information 31
    32. 32. Performance Evaluation 33
    33. 33. 0 20 40 60 80 100 667 571 500 444 400 364 Correctdetectionrate(%) Gross bitrate (bps) silent public rock electronic Data-rate vs. Reliability ~100% correct detection rate was achieved with 400 bps annotation data rate. 34
    34. 34. Travel Distance Watermark signal can travel up to 20cm through air from a smartphone speaker 35 0 20 40 60 80 100 0 5 10 15 20 25 30 Correctdetectionrate(%) Distance between speaker and microphone (cm) silent public rock electronic
    35. 35. Durability against Conversion 36 Watermarks are preserved after conversion into Ogg Vorbis, AC-3 and AAC with enough bitrate. 0 20 40 60 80 100 128 192 256 320 Correctdetectionrate(%) Bit rate (kbps) MP3 Ogg Vorbis AC-3 AAC
    36. 36. Transparency for Human Ear 37 Measured noticeability of watermarks for human • Click a button after notice of noise (6 participants) 0 20 40 60 80 100 silent public rock electronic NoticedWatermarkRate(%) Before Erasure After Erasure
    37. 37. Limitations 38 • One-off development of annotation-embedding applications • Audio quality loss in watermark removal • Limited data-rate of annotation
    38. 38. Future Work 39
    39. 39. Embedding from Public Speaker 40 • Synchronization & integration of large number of videos to create multi-view videos, etc. • Entertainment use at amusement parks, etc. “Sleeping Beauty Castle at Disneyland” by Lyght Licensed under CC BY-SA 3.0 “Picture of Stadium” by Jazza5 Licensed under CC BY-SA 3.0
    40. 40. Conclusion 41
    41. 41. We proposed 42 a video annotation technique using audio watermarking, and a video-editing workflow exploiting annotations. Benefit AnnoTone can facilitate and enhance non-professional video editing process without special equipment.
    42. 42. 43
    43. 43. Compared with Smartphone Recording Some smartphone camera apps can record annotation as metadata format (e.g., Adobe XMP) – Of course, using such apps is clever for smartphone recording occasions What’s AnnoTone’s superiority? • Dedicated video cameras are still superior to smartphone camera – In resolution, definition, lens quality, etc. • No need of dealing with external metadata – Because annotations are directly embedded as sound 44