Successfully reported this slideshow.
Your SlideShare is downloading. ×

TAAI 2016 Keynote Talk: It is all about AI

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 79 Ad
Advertisement

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to TAAI 2016 Keynote Talk: It is all about AI (20)

Advertisement

Recently uploaded (20)

TAAI 2016 Keynote Talk: It is all about AI

  1. 1. 1 It is all about AI Mark Liao Institute of Information Science Academia Sinica, Taiwan (TAAI 2016)
  2. 2. Contents of this talk • Automatic Concert Video Mashup • Spatio-Temporal Learning of Basketball Offensive Strategies 2
  3. 3. 1 Automatic Concert Video Mashup Mark Liao Institute of Information Science Academia Sinica, Taiwan
  4. 4. What is concert video mashup ? • A concert video mashup process is to deal with all videos captured from different locations of a concert hall and convert them into a complete, non- overlapping, seamless, and high-quality outcome. 4
  5. 5. Why concert video mashup ? • To provide people who could not attend live concert a second chance to enjoy the performance with similar quality. 5
  6. 6. Many problems to be solved ! • Videos were captured with no coordination, incompleteness or redundancy happens always. • The order to watch these videos often causes confusion. • These videos were captured by handheld devices, their visual/audio quality cannot be guaranteed. 6
  7. 7. Issues need to be addressed • The order to watch • Visual quality optimization • Seamless sound track connection • No redundancy • No missing video segments • Mashup results follow the rules defined by language of film 7
  8. 8. Potential Issues: The order to watch(1/5) • Three video clips captured from 3 different angles, different distances, 1&2 partially overlapped, 3 independent 8 1 2 3
  9. 9. Potential Issues: Multiple audio sequence alignment (2/5) Case 1: partially overlapped Case 2: no overlap 9
  10. 10. Potential Issues(3/5) • Among three videos coherent in time, which one should be chosen ? (3 different locations) -- follow the rules of language of film ! 10 Medium Shot Long Shot Extreme Long Shot
  11. 11. • Among several qualified videos clips, which one should be chosen ? Same distance ! -- visual quality ? audio quality ? 11 Potential Issues(4/5) Extreme Long Shot Extreme Long Shot
  12. 12. Potential Issues(5/5 ) • How to present the emotion, ideas, and art of a music director into a concert video mashup process ? Can a CNN learn facial emotion ? 12
  13. 13. Previous Effort • The closest research area to ``automatic video mashup’’ is ``summarizations of multi-view videos’’ • The objective of the latter is to produce a reduced set of abstracted videos or key-frame sequence that can represent the most prominent parts of the input videos. 13
  14. 14. Literatures related to video mashup (1/3) • [Shrestha et al.] formulate video mashup as an optimization problem - pros – optimizing visual quality and diversity constraints - cons – did not take into account professional view of a visual storytelling director P. Shrestha et al., automatic mashup generation from multiple- camera concert recordings, ACM MM, 2010. 14
  15. 15. Literatures related to video mashup (2/3) • [Wu et al.] put some pre-defined rules to solve the frequent-shot-change problem - pros – can solve part of the shot change problem - cons – did not involve a visual storytelling director to instruct a video mashup process Wu et al., MoVieUp: Automatic mobile video mashup, IEEE TCSVT, 2015. 15
  16. 16. Literatures related to video mashup (3/3) • [Saini et al.] introduce visual storytelling rules by dividing audience seats into six shooting locations and then calculate statistics of shot transition and length from professionally edited videos - pros – a good start by introducing the views of professional experts - cons – shot types defined by themselves, not by rules defined in language of film Saini et al., MoViMash: Online mobile video mashup, ACM MM, 2012. 16
  17. 17. Introduction • An experienced movie director frequently use camera work practice in visual storytelling. Intro Verse Verse Chorus Chorus Bridge Bridge . . . 16
  18. 18. Introduction • Applications – Mashup – Emotion (music video) 18
  19. 19. Introduction • According to the language of film [3], shot size is one of the basics of filmmaking. 19 Long Shot Close-Up
  20. 20. Introduction 20 • The definition of six types of shots [3].
  21. 21. Introduction • Definition from the language of film [3], a concert video contains eight types of camera shots. 20 Musical Instrument Shot (MIS)Audience Shot (ADS)
  22. 22. INTRODUCTION • Two images from an official concert video of the song “93 million miles” by Jason Mraz live at Hong Kong 2012. 22
  23. 23. System Framework for Video Mashup 23
  24. 24. Shot Classification based on EW-Deep-CCM • Error-Weighted Deep Cross-Correlation Model 24
  25. 25. Object Representation (VGG-Net) • Object representation using a 16-layer VGG-Net • we extract features from the output layer and the two fully- connected layers as the object representations, the feature dimensions are 1000-D, 4096-D and 4096-D, respectively. 25
  26. 26. Object Representations (1/2) ImageNet1000 object representation 26
  27. 27. Object Representations (2/2) 27
  28. 28. Literatures related to Fusion Strategy • Early fusion – Pros: Take the advantage of combining various feature cues – Cons: High dimensional feature set may easily suffer from the problem of data sparseness, and stress the computational resources. 28
  29. 29. Literatures related to Fusion Strategy • Late fusion – Pros: Without increasing the dimensionality Interpret the performance of different classifiers and gain insight into the role of multiple modalities during emotional expression – Cons: The assumption of conditional independence among multiple modalities is inappropriate. 29
  30. 30. Shot Classification based on EW-Deep-CCM • A novel fusion strategy named Error Weighted Deep Cross-Correlation Model (EW-Deep-CCM) is proposed to effectively combine the extracted multilayer object representations. 30
  31. 31. Experimental Results • Comparison of Shot Type Classification (other method) 31
  32. 32. • EW-Deep-CCM only achieves 83% detection rate • 17% error remain, i.e., 1/6 error rate, this will cause frequent shot changes 32
  33. 33. 17% error rate causes too many shot changes 31
  34. 34. Conditional Random Field-based (CRF) Approach • 1st trial: 30-frame fixed window size (not a systematic way to smooth the results) • 2nd trial: Recurrent Neural Network (RNN) -- Problem: RNN needs pre-segmented data to derive best results, but the shot type classification results generated are not well segmented • 3rd trial: Conditional Random Field (CRF) 34
  35. 35. OUR METHOD – Coherent-Net Shot Type Refinement (CRF) 35
  36. 36. OUR METHOD – Coherent-Net Framework Shot Type Refinement (CRF) ( | ')P w w ( | )P w O ' 1 ( | )= ( , '| ) ( | ') ( '| ) ( | ') ( '| ) N n n n P P P P P P w o = ≈ ⋅ ≈ ⋅ ∑ ∏ w w O w w O w w w O w w CRF EW-Deep-CCM ( '| )P w O 36
  37. 37. (EW-Deep-CCM) Likelihood (DNN posterior probability) Cross-correlation Empirical weight 1 1 1 ( '| ) ( '| , ) ( | ) ( | , ) ( | , ) ( | , ) ( | , ) ( | ) C D K out out fc out ij k k ij i i k j i k i j k out fc fc fc i j k j j k ij ij P w o P w w P w P o w P w P w P o w P o β α α β = = = ≈ Λ Λ Λ × Λ Λ ∑∑∑           Shot Type Refinement (CRF) ( | ')P w w ( | )P w O( '| )P w O 37
  38. 38. 1 1' ', , ', 't tw w w−=w  1w=w 2w 3w 1tw − tw ( ) ( ) 1 ( | ') exp , ' ' j j P F   =     ∑w w w w Z w ( ) ( )' exp , 'j j F   =     ∑ ∑w Z w w w ( ) ( ) ( )1 1 exp , , ' , ' ' j j t t j j t t j t j t w w s wλ µ−   ∝ +    ∑∑ ∑∑w w Z w ( ) 1{ } { } { } { ' } , 1 exp ' t t t tmn w m w n om w m w o t m n S t m S o O λ µ−= = = = ∈ ∈ ∈   ∝ +    ∑ ∑ ∑∑∑1 1 1 1 Z w ( ) 1 , ' 0 j ts w  =   w when and't w o= t w m= otherwise State-observation pairState transition ( )1 1 , , ' 0 j t tt w w−  =   w when and 1t w n− =t w m= otherwise (CRF) unary potentialpairwise potential CLCCCC CCCCCC 38
  39. 39. EXPERIMENTS – Official Demo 1 39 • the song “Skyfall” by Adele perform at Oscar 2013
  40. 40. EXPERIMENTS – Official Demo 2 • the song “When I was Your Man” by Bruno Mars perform at BBC Radio 1's Big weekend 2013 40
  41. 41. System Framework for Video Mashup 41
  42. 42. Problem & Goal • A concert video mashup process needs to align the videos taken by variant audiences into a common timeline. 42
  43. 43. Literature Review • Audio fingerprinting • Problems – Originally designed for the problem of audio identification rather than that of time alignment. – Easily cause audio signal distortion • Zhu et al. treat audio identification as an image matching problem. (significant performance improvement) • B. Zhu et al., “A novel audio fingerprinting method robust to time scale modification and pitch shifting,” ACM MM, 2010. 43
  44. 44. Our Method • We modified Zhu’s method to address the multiple audio sequences alignment problem. – Auditory image (spectrogram) construction 1-D audio signal (waveform) 2D auditory image Time-frequency representation (spectrogram) Short-time Fourier transform 44
  45. 45. Our Method – Audio Sequences Alignment (1) Boundary candidate selection (based on SIFT alignment) -where a is a SIFT feature in audio sequence A, b is the closest feature of a in B, b’ is the second closet feature of a in B. bA Ba ' , ( , ) ( , ) , Yes if D a b c D a b BC No otherwise  < ∗ =   BC: boundary candidate D(.): Euclidean distance c: a constant (c=0.7) Yellow lines are boundary candidates 45
  46. 46. Our Method – Audio Sequences Alignment (2) Boundary candidate refinement. -A window distortion measure (WDM) is defined for each boundary candidate refinement. 46
  47. 47. Our Method – Audio Sequences Alignment (3) Final boundary decision. -The alignment result is determined by a refined boundary candidate that with minimum window distortion. 47
  48. 48. DEMO 1 • “I’m Yours” by Jason Mraz live at Singapore 2012 – with context search (Aligned in 49.8001 s) 48 Time Line00:00:00 00:00:49.8001 Recording #4 Recording #5 +0.4334 s
  49. 49. DEMO 2 • “All I Ask” by Adele live at Birmingham Genting Arena 2016 – with context search (Aligned in 53.2169 s) 49 Time Line00:00:00 00:00:53.2169 Recording #1 Recording #2 +0.5502 s
  50. 50. TimeLine 00:00:00 00:00:52.4893 04:00:2277 00:00:52.7667 03:58:8667 Audience #1 Audience #2 Audience #3 Demo - Multiple Audio Sequence Alignment Result 50
  51. 51. Learning Professional Recording Skill 51 Initial Prbo. Duration (frames/shot) Shot Transition (prob.) Shot Type Refinement (CRF) Coherent-Net
  52. 52. System Framework for Video Mashup 52
  53. 53. Demo - Mashup Result 53 mr#1 mr#2 mr#3
  54. 54. 1 Spatio-Temporal Learning of Basketball Offensive Strategies
  55. 55. Motivations • To develop an automatic tactics analysis tool for coaches, players, and general publics. • To develop a new technique that can compete with existing tools, such as sportVU, but with much lower price 55
  56. 56. Methodology Adopted • To analyze group behavior directly from the court-view of an NBA broadcast video • Detect and track each offense player, calculate their trajectories and map these trajectories from court view to tactic board for analysis 56
  57. 57. Motivation (1) 57
  58. 58. Motivation (2) 58
  59. 59. Motivation (3) • Unknown Offense Video Clip 90% → Screen Cut 10% → Princeton
  60. 60. 60 • 6 cameras above the court • No close-up view → Unable to see the details of plays
  61. 61. 61 SportVU videos Broadcast videos Tracked data Tracked data SportVU system Our tracking system ?
  62. 62. Extracting features from an offense video clip ? • Automatic player detection • Automatic player tracking • Map extracted trajectories from basketball court to tactic board 62
  63. 63. step 2: Derive correct player trajectories on panorama court (3/3) 63
  64. 64. step 3: Map trajectories from panorama court to tactic board 64
  65. 65. What’s next ? –Tactics Analysis based on spatiotemporal trajectories of 5 offense players 65
  66. 66. A Two-Stage Un-supervised Clustering for Tactic Analysis • Stage-1: Un-supervised clustering of all available tactics based on their mutual distances • Stage-2: Un-supervised clustering of all tactics clustered into the same cluster in Stage-1 (try to separate the role of each offense player) 66
  67. 67. What techniques are needed ? • A spatiotemporal model that can describe the group behavior of 5 offense players • Automatic clustering of group behaviors (screen-cut, Princeton, wing-wheel, etc) • Representation of each group behavior • An appropriate metric to calculate the distance between two arbitrary tactics. 67
  68. 68. Trajectory set Representation S: the spatiotemporal matrix; Pij=(xij,yij): 2D coordinate of the j-th player in the i-th frame; Vj=[P1j P2j… PLj]T; S=[V1 V2 V3 V4 V5 (V6)];
  69. 69. Distance Measure of Trajectory Set • Problems • Different time durations between 2 clips • Ordering of column vectors
  70. 70. Trajectory Set Distance Matrix S1=[V1 V2 V3 V4 V5] S2=[U1 U2 U3 U4 U5]
  71. 71. Clustering by Dominant Set PAMI 07. Massimiliano Pavan and Marcello Pelillo. Dominant Sets and Pairwise Clustering Tactic1 Tactic2 Tactic3
  72. 72. Second-stage: how to model an offense strategy ? • 8 different trajectory sets of right hawk, each consists of 5 trajectories generated by 5 offense players
  73. 73. Clustering by Trajectory Distance • Based on the distance between trajectories, one can separate each group of tactics into five group of trajectories, each corresponds to a role (an offense player) Hawk Wing Wheel Princeton
  74. 74. Temporal Alignment For each role, we use the velocities along x- and y-direction, respectively, to model it (use DTW to solve the alignment problem)
  75. 75. The Built Model
  76. 76. Demo _ Classification Hawk template
  77. 77. Demo _ Classification Princeton template
  78. 78. Demo _ Classification Wing wheel template
  79. 79. Thank you very much for listening 79

×