Example-Based Remixing of Multimedia Contents


Published on

Example-Based Remixing of Multimedia Contents

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Example-Based Remixing of Multimedia Contents

  1. 1. Noboru Babaguchi Osaka University Joint Work with Prof. N. Nitta ICME2013 Co-located WS MMIX13, Keynote San Jose, July 18, 2013
  2. 2.  Definitions, Background, Related Work  Multimedia Remixing Support System  Video Clip Sequence Creation  Music Clip Selection  Shot Extraction  Conclusion and Future Work
  3. 3.  From wikipedia… A remix is a song that has been edited to sound different from the original version. The person who remixed it might have changed the pitch of the singers' voice, changed the tempo and speed and has made the song shorter or longer, or instead of hearing just one person singing they might have duplicated the voice to make it sound like two people are singing, or make the voice echo. Remixes should not be confused with edits, which usually involve shortening a final stereo master for marketing or broadcasting purposes. … A remix song recombines audio pieces from a recording to create an altered version of the song. In recent years the concept of the remix has been applied analogously to other media. …. Scary Movie series is famous for its comic remix of various well- known horror movies such as Ring, Scream, and Saw.
  4. 4. Video Remix: a video clip made by recombining various media components to create an altered version of the original videos. Video transition effects (Cut, fade-in/out, dissolve, etc.) Audio clips (music, sound effects, voices, etc.) Original video clips Video remixes (e.g. movie trailers) Video clip selection & arrangement Multimedia stream Combination How can we create video remixes of good quality? from “The School of Rock” (2003)
  5. 5.  Semantic Aspect:  What should we present? (Semantic Content)  Highlights of Sports Games, etc.  Affective Aspect:  How should we present the video content? (Aesthetic Compatibility, Film Syntax)  Commercial Films,Movie Trailers, etc.  How to arrange video clips or what music clip to augment to enhance the expressive quality Two aspects in video remixing Video Summarization
  6. 6. Video Remix Scene-Music Relation Shot-Scene Relation A sequence of L video shots A sequence of D music clips A video scene Problem of Video Remixing A music clip = A sequence of D video scenes An excerpt from a video clip To maintain the feeling of continuity in a scene
  7. 7.  Hitchcock[Girgensohn2001]  Template-based Editing[Davis2003]  Lazycut[Hua2005]  Emotion-based[Canini2010]  Video-Music Mixing [Mulhem2003][Hua2004][Wang2005][Yoon20 09][Cristani2010]
  8. 8. Video clip selection and arrangement Focused on how various types of video clips are arranged in sequence. For example… • A scene has to have at least three video clips[Sundaram01]. • Two video shots of extremely different shot sizes should not be connected[Kumano02]. • The duration of a shot recorded with the camera fixed is up to 15 seconds[Kumano02]. Film Syntax [Sundaram01] H. Sundaram, et al., “Condensing computable scenes using visual complexity and film syntax analysis,” Proc. ICME, pp.389-392, 2001. [Kumano02] M. Kumano, et al., “Video editing support system based on video content analysis,” Proc. ACCV, pp.628-633, 2002. [Canini10] L. Canini, et al., “Interactive video mashup based on emotional identity,” Proc. European Signal Processing Conf., pp.1499-1503, 2010. Aesthetic Compatibility •Shots with similar emotional impact should be connected[Canini10].
  9. 9. Music clip selection Focused on which types of music clips are mixed with video shots. For example… • dynamic, motion, and pitch of image and audio streams coincide with each other[Mulhem03]. • novelty, velocity, and brightness of image and audio streams coincide with each other[Yoon09]. Aesthetic Compatibility [Mulhem03] P. Mulhem, et al., “Pivot vector space approach for audio-video mixing,” IEEE Multimedia, 10(2), pp.28-40, 2003 [Yoon09] J.-C. Yoon, et al., “Automated music video generation using multi-level feature-based segmentation,” MTAP, 41(2), pp.197-214, 2009 [Cristani10] M. Cristani, et al., “Toward an automatically generated soundtrack from low-level cross-modal correlations for automotive scenarios,” Proc. ACM Multimedia, pp.551-559, 2010 Determined heuristically • brightness of image and audio streams and rhythm of audio stream and optical flow in image stream coincide with each other[Cristani10] Determined statistically
  10. 10. Multimedia Remixing Support System
  11. 11.  It is difficult to explicitly defining the rules and know-how about how the video and music clips should be arranged, considering the aesthetic compatibility.  The rules and structures commonly used in professionally created examples can be modeled by standard machine learning techniques.  Non-professional users can be supported on their interface based on the models which implicitly describe shot-scene and scene-music relations considering aesthetic compatibility.
  12. 12. A Set of Video Remix Examples Professionally Created Video Remixes
  13. 13. A Set of Video Remix Examples Target: Remixing original video clips based on Examples A Set of Music Clips A Set of Original Video Clips video remix
  14. 14. video remix I) Video Clip Sequence Creation Interface II) Music Clip Selection III) Shot Extraction (Video and Music Synchronization) User ・・・ ・・・ A set of video clips: A set of music clips: A Set of Video Remix Examples ・・・ ・ ・ ・・・ Video Remix Template Shot Scene Video Clip Suggestions
  15. 15. N. Nitta and N. Babaguchi, “Example-based video remixing,” Multimedia Tools and Applications, 51(2), pp.649-673, 2011 N. Nitta and N. Babaguchi, “Example-based home video remixing,” Proc. ICME, 2011
  16. 16. Video Remix Examples Symbol Sequence Home (Personal) Videos Video Clips Segmentation Suitability[Nitta2011] To Template Perceived Quality[Tao2007] B AB CGE Template Interface Overview of Procedure I) Template Generation T. Mei, et al., "Home Video Visual Quality Assessment With Spatiotemporal Factors," IEEE Trans. Circuits and Systems for Video Technology, vol.17, no.6, pp.699-706, 2007.
  17. 17. Video Remix Examples Slow Scene Active Scene HMM Example-based Template Generation Shot Length Brightness Motion Intensity w/wo Camera Work w/wo Human Objects Low-level Features Feature Extraction ・・・ Sequences of video shots Shot ihg fed cba Symbolization Symbol Sequence Video Remix Template (New Symbol Sequence & State Sequence) GA A Sequence of L Shots A Sequence of D Scenes
  18. 18. Video Clip 1 Video Clip 2 Video Clip 3 A Home Video Suitability to Template 0.3 0.20.7 Perceived Quality 0.7 0.5 0.6 From Shot to Video Clip Shots in target video are divided into video clips based on the camerawork
  19. 19. Video clip selection Video Remix Template Interface 3D book-style video clip presentation Timeline Presentation Suitability To Template Perceived Quality ◎ × △ ▲ spine Fore edgeFore edge
  20. 20. Interface
  21. 21.  Video remix examples: 61 action movie trailers  Video clips: 265 home (personal) video clips recording a sports field day held by a kindergarten  Subjective evaluation by 8 subjects  Compare with video clip sequence created by considering only the perceived quality of video clips
  22. 22. Subjective Score: 3.5 Subjective Score: 3 With Template*Without Template * Selected video clips are shortened according to the template Created Video Clip Sequence Using action movie trailers as examples resulted in creating a sequence of many short video clips
  23. 23. N. Nitta and N. Babaguchi, “Example-based video remixing support system,” Proc. ACM Multimedia, pp.563-572, 2011
  24. 24. Video Clip Sequence (Scene) Overview of Procedure II) A Set of Video Remix Examples (Scenes) A set of Music Clips visually similar video remix examples similar music clips
  25. 25.  Evaluate the compatibility among video scenes and music clips by their distances in the video scene and music feature spaces  Learn non-linear mapping of music feature space so that the distances among video scenes and the mixed music clips would be correlated [Suzuki07] Music Clip Feature Space (Music Clips Mixed to Example Video Scenes) Video scene feature space (Example Video Scenes) Expected Music clip feature space (Music Clips Mixed to Example Video Scenes) [Suzuki07] K. Suzuki, et al., “A similarity-based neural network for facial expression analysis,” Pattern Recognition Letters, 28(9), pp.1104-1111, 2007
  26. 26.  Music Clip Selection  Video Scenes・・・Visual Features  Music Clips・・・Audio Features  [Zettl99]  Emotion-based Music Classification [Zettl99] H. Zettl, “Sight Sound Motion: Applied Media Aesthetics,” Wadsworth Publishing, 1999
  27. 27.  Consists of 2 Neural Networks  Input: Audio Features xA i and xB i of Music Clips A and B  Output: Transformed Audio Features yA j and yB j of Music Clips A and B  Learn the weights wl,m of Neural Network so that the differences between the distances of yA j and yB j and the distances of the video scenes mixed with music clips A and B would be minimized.  wl,m: Weight for the edge between nodes I and m. ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・・・ TAB dAB Teacher (Distances ofVideo Scenes Mixed with Music Clips A and B ) Input A Input B xA i xB i Neural Network A Neural Network B yA j yB j Distance calclulation
  28. 28. Interface
  29. 29.  Video Remix Examples: 61 Action Movie Trailers  Video Scene Examples :45 Scenes  Music Clips:180 Music Clips of Various Genres (Movie Soundtracks, Classical Music, Japanese-pop, Western-pop, etc.)  Video Clips:  Shots extracted from Original Movies  265 Home Video Clips recording a sports field day held by a kindergarten  Video Clip Sequence:  Made by Procedure I)
  30. 30.  Input:10 Video Scenes randomly extracted from movie trailers (without Audio Stream )  10 subjects rated (1: very bad – 10: very good) 10 video scenes mixed with  Video1) 3 Music Clips Selected by Proposed Approach  Video2) Music Clips most similar to the music excerpts mixed with the 3 least similar video scenes  Video3) Music Clip mixed with the video scenes in movie trailers (baseline: professional)  Video4) 3 Music Clips selected in the same way as for Video 1) without music feature space transformation  Video5) 3 Music Clips selected in the same way as for Video 2) without music feature space transformation Video1 –Video 2 = 1.72±0.34 (95% confidence interval) ⇒indicates the effectiveness of similarity-based music clip selection Video1 –Video 4 = 1.11±0.35 ⇒indicates the effectiveness of music feature space transformation Video 1 → closest toVideo 3 ⇒selected music clips are subjectively closest to professionally selected ones 0 1 2 3 4 5 6 7 8 Video1 Video2 Video3 Video4 Video5 Average Subjective Scores 6.1 4.4 7.2 5.0 4.5
  31. 31. Video1 6.8 Video2 2.7 Video3 8.3 Video4 6.4 Video5 3.0 Score
  32. 32. Video1 8.5 Video2 2.1 Video3 5.5 Video4 5.5 Video5 2.3 Score
  33. 33. Subjective Score: 3.8 Subjective Score: 5.3 With Template*Without Template Video Clip Sequence after Music Mixing Subjective score improved largely after music mixing Created video clip sequence and selected music clips are synergetic in improving the expressive quality. * Selected video clips are shortened according to the template
  34. 34. Y. Kurihara, N. Nitta, and N. Babaguchi, “Automatic appropriate segment extraction from shots based on learning from example videos,” Proc. PSIVT, pp.1082-1093, 2009 Y. Kurihara, N. Nitta, and N. Babaguchi, “Appropriate segment extraction from shots based on temporal patterns of example videos,” Proc. MMM, pp.253-264, 2008
  35. 35. VideoClip SequenceVideoRemix Video Clip 1 Shot 1 Shot 2 Shot 3 Video Clip 3Video Clip 2  A video clip needs to be shortened.  A video clip contains redundant parts. Which part of a video clip should be extracted as a shot? Shot Extraction from Selected Video Clip
  36. 36. k frames Discarded part (Non-shot) Selected Part (Shot) Video Clip Example Video Clip Shot Extraction Feature Extraction Pattern Scan for the k frames which best matches the shot HMM Feature Extraction ShotSymbolization Symbol Sequence Shot HMM Non-shot HMM Overview of Procedure III)
  37. 37. •Shot Classification action and conversation •Feature extraction Shot Action Conversation Scenery ・・・ ※VSTD : Volume Standard Deviation, LVFR : Low Volume Signal Ratio,ERSB : Energy Ratio of Ferquency SubBand ZCR : Zero Crossing Ratio Each type of shot is characterized by different features
  38. 38.  Examples:Movies+Trailers Video Clips:Shots in Movies Shots:Shots in Trailers  Shot extraction from 69 video clips (shots in movies)  Shot Length (k) = Length of corresponding shots in trailers (32.3% ×video clips on average) 2247Test 1210Training ConversationAction Experiments
  39. 39. Objective Evaluation Video Clip (Action) Ground Truth (Shot in Trailer) Extracted Shot 82 frames k= 9 frames Difference:3 frames(0.3sec) •Compare Extracted Shot with Ground Truth •1 frame=0.1 sec
  40. 40. 107 frames Extracted Shot Ground Truth k= 17 frames Difference:3 frames (0.3sec) Video Clip (Conversation)
  41. 41. -25 -20 -15 -10 -5 0 5 10 0 5 10 15 20 25 30 35 40 45 50 フレーム# LogP f(n);編集区間モデル g(n);非編集区間モデル f(n)-g(n) Extracted Shot Ground Truth hk(f)-gk(f) Correctly Extracted Shot 47 frames k = 5 frames Difference:3 frames Video Clip (Action) Extracted Shot Ground Truth Shot HMM Non-Shot HMM frame
  42. 42. -30 -25 -20 -15 -10 -5 0 5 0 5 10 15 20 25 30 35 フレーム# logP f(n);編集区間モデル g(n);非編集区間モデル f(n)-g(n) Extracted Shot Ground Truth hk(f)-gk(f) Incorrectly Extracted Shot 35 frames k = 7 frames Difference:26 frames Extracted Shot Ground Truth Shot HMM Non-Shot HMM frame
  43. 43. Objective Evaluation clipsvideoof# extractioncorrectof# accuracy  ※Correct Extraction : Shot was extracted within T-frame Difference 1 frame = 0.1 sec Correct shots were extracted from 72.5%(50/69) of video clips when T=5 73%(16/22)72%(34/47)T=5 64%(14/22)60%(28/47)T=3 50%(11/22)53%(25/47)T=2 Action Conversation
  44. 44.  14 subjects watch original long video clips, and then three kinds of shortly extracted shots: ①Ground Truth ②Extracted Shot ③Random Shot in random order and rank them. (There can be a tie) or or
  45. 45. Ground Truth:③ Extracted Shot:② Random Shot:① Video Clip (36 frames) ① ② ③ k = 15 frames
  46. 46. Ground Truth Extracted Shot Random Shot ・・・Rank 1 ・・・Rank 2 ・・・Rank 3 69.1% 26.9% 4.0% 53.9%38.9% 7.2% 7.1% 12.9% 80.0% Action:18 video clips Conversation:13 video clips Subjective Evaluation Extracted Shot ≒Ground Truth >> Random Shot
  47. 47. Subjective Score: 6.2Subjective Score: 3.9 Without Template With Template Created Video Remix Proposed Comparative I II III I II III Length (min:sec) 0:36 0:43 10:56 10:59 score 3 5.3 6.2 3.5 3.8 3.9
  48. 48.  Introduced an example-based approach for video remixing  Video Clip Sequence Creation  Music Clip Selection  Shot Extraction  Interface  Experiments using movie trailers as remix examples and movies and home videos as video clips  Verified the effectiveness of using remix examples  With Support(6.2), Without Support(3.9) Conclusion
  49. 49.  Improvement of Interface  More investigations using various types/genres of video remix examples  How many examples do we need?  Good examples can reduce the number of examples.