Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Multimodal pattern matching algorithms and applications


Published on

In this presentation I focus on 3 projects I have been working in the last year. The first one is a novel pattern matching algorithm, based on the well known Dynamic Time Warping. The presented algorithm can be used to find real-valued subsequences within a longer sequence, without prior knowledge of their start-end points. I have applied the algorithm for the task of acoustic matching, for which I will show some preliminary results. Then I will continue to explain a second DTW-based algorithm, this one being able do an online of two musical pieces. One of the music pieces can be input life or be retrieved from an audio file, while the second one is extracted from an online music video. The online alignment allows for the music video to be played in total synchrony with the corresponding ambient/recorded audio. Finally, I will talk about video copy detection, which is the task of finding video duplicate segments within a big database. I will explain our multimodal approach, based on audio-visual change-based features.

Published in: Technology
  • Be the first to comment

Multimodal pattern matching algorithms and applications

  1. 1. Mul$modal  pa+ern  matching   algorithms  and  applica$ons   Xavier  Anguera   Telefonica  Research  
  2. 2. Outline   •  Introduc$on   •  Par$al  sequence  matching     –  U-­‐DTW  algorithm     •  Music/video  online  synchroniza$on     –  MuViSync  prototype   •  Video  Copy  detec$on  
  3. 3. Par$al  Sequence  Matching  Using  an   Unbounded  Dynamic  Time  Warping   Algorithm   Xavier  Anguera,  Robert  Macrare  and   Nuria  Oliver   Telefonica  Research,  Barcelona,  Spain  
  4. 4. Proposed  challenge   •  Given  one  or  several  audio  signals  we  want  to   find  and  align  recurring  acous$c  pa+erns.  
  5. 5. Proposed  challenge   •  We  could  use  the  ASR/phone$c  output  and  search  for  symbol   repe$$ons   PROS:   –  It  is  easy  to  apply,  the  ASR  takes  care  of  any  $me  warping   CONS:   –  ASR  is  language  dependent  and  requires  training   –  We  introduce  addi$onal  sources  of  error  (acous$c  condi$ons,  OOV’s)   –  It  can  be  very  slow  and  not  embeddable   •  Automa$c  mo$f  discovery  directly  in  the  speech  signal   –  Train  free,  language  independent  and  resilient  to  some  noises   Symbolic   representa$on   ASR/ symbols   Phone$za$on   alignment   acous$c   •   Alignment  loca$ons   alignment   •   Scores  
  6. 6. Areas  of  applica$on   •  Improve  ASR  by  disambigua$on  over  several   repe$$ons  (Park  and  Glass,  2005)   •  Pa+ern-­‐based  speech  recogni$on  –  flat   modelling  (Zweig  and  Nguyen,  2010)   •  Acous$c  summariza$on  (Muscariello,  2009)   •  Musical  structure  analysis  (Müller,  2007)   •  Server-­‐less  mobile  voice  search  (Anguera,  2010)    
  7. 7. Automa$c  mo$f  discovery   •  Goal  is  to  avoid  going  to  text  and  therefore  be   more  robust  to  errors   •  Good  deal  of  applicable  work  on  this  area:   –  Biomedicine  in  matching  DNA  sequences   (conver$ng  the  speech  signals  into  symbol  strings)   –  Directly  from  real-­‐valued  mul$dimensional   samples  using  DTW-­‐like  algorithms   •  Müller’07,  Muscariello’09,  Park’05,  Zweig’10   •  Most  need  to  compute  all  the  cost  matrix  a  priori  
  8. 8. Dynamic  Time  Warping  -­‐  DTW   •  DTW  algorithm  allows  the  computa$on  of  the   op$mal  alignment  between  two  $me  series       Xu,  Yv  ε  ΦD     XU = (u1,...,um ,...,uM ) X V = (v1,....,v n ,..,v N ) Image  by  Daniel  Lemire  
  9. 9. Dynamic  Time  Warping  (II)   •  The  op$mal  alignment  can  be  found  in  O(MN)  complexity   using  dynamic  programming.   •  We  need  to  define  a  cost  func$on  between  any  two   elements  in  the  series  and  build  a  distance  matrix:   d : ΦD × ΦD →ℜ ≥ 0 Where  usually:   d(i, j) = um − v n € Euclidean  distance   Image  by  Tsanko  Dyustabanov   Warping  func$on:      F      =    c(1),...,c(K)    where   c(i(k), j(k))                                              €        
  10. 10. Warping  constraints   For  speech  signals  some  constraints  are  usually   applied  to  the  warping  func$on  F:   –  Monotonicity:             i(k −1) ≤ i(k) j(k −1) ≤ j(k) –  Con$nuity  (i.e.  local  constraints):   i(k) − i(k −1) ≤ 1 j(k) − j(k −1) ≤ 1 € € (m,  n)   ⎧ D(m −1,n) (m-­‐1,  n)   ⎪ D(m,n) = min⎨ D(m,n −1) + d(um ,v n ) € (m-­‐1,  n-­‐1)   € ⎪ ⎩ D(m −1,n −1) Sakoe,H.  and  Chiba,S.  (1978)  Dynamic  programming  algorithm  op0miza0on  for  spoken  word  recogni0on,  IEEE  Trans.   on  Acoust.,  Speech,  and  Signal  Process,  ASSP-­‐26,  43-­‐49.  
  11. 11. Warping  constraints  (II)   –  Boundary  condi$on:     i(1) = 1 j(1) = 1 i(K) = M j(K) = N i.e.  DTW  needs  prior  knowledge  of  the  start-­‐end   alignment  points.   –  Global  constraints   € € € € Image  from  Keogh  and  Ratanamahatana  
  12. 12. DTW  Dynamic  Programming  
  13. 13. DTW  Dynamic  Programming  
  14. 14. DTW  Dynamic  Programming  
  15. 15. DTW  Dynamic  Programming  
  16. 16. DTW  main  problem   •  The  boundary  condi$on  constraints  $me-­‐ series  to  be  aligned  from  start  to  end   –  We  need  a  modifica$on  to  DTW  to  allow  common   pa+ern  discovery  in  reference  and  query  signals   regardless  of  the  sequence’s  other  content  
  17. 17. Alterna$ve  proposals   •  Meinard  Müller’s  Path  extrac$on  for  music   –  Needs  to  pre-­‐compute  the  complete  cost  matrix.   •  Alex  Park’s  Segmental  DTW   –  Needs  to  pre-­‐compute  the  complete  cost  matrix,   very  computa$onally  expensive  ajerwards.     •  Armando  Muscarielo’s  word  discovery   algorithm   –  Searches  for  pa+erns  locally,  does  not  check  all   possible  star$ng  points.   [1]  M.  Müller,  “Informa$on  Retrieval  for  Music  and  Mo$on”,Springer,  New  York,  USA,  2007.   [2]  A.  Park  et  al.,  “ Towards  unsupervised  pa+ern  discovery  in  speech,”  in  In  Proc.  ASRU’05,  Puerto  Rico,  2005.   [3]  A.  Muscariello  et  al.,  “Audio  keyword  extrac$on  by  unsupervised  word  discovery,”  in  Proc.  INTER-­‐  SPEECH’09,  2009.  
  18. 18. Unbounded-­‐DTW  Algorithm   •  U-­‐DTW  is  a  modifica$on  to  DTW  that  is  fast  and   accurate  in  finding  recurring  pa+erns   •  We  call  it  unbounded  because:   –  The  start-­‐end  posi$ons  of  both  segments  are  not   constrained   –  Mul$ple  matching  segments  can  be  found  with  a   single  pass  of  the  algorithm   –  Minimizes  the  computa$onal  cost  of  comparing  two   mul$dimensional  $me  series  
  19. 19. U-­‐DTW  Cost  func$on  and  matching  length   •  Given  two  sequences  to  be  matched        U=(u1,  u2,  …,  uM)  and  V=(v1,  v2,  …,  vN)      we  use  the  inner  product  similarity       um ,v n s(m,n) = cosθ = um v n  Values  range  [-­‐1,1],  the  higher  the  closer   •  We  look  for  matching  sequences  with  a  minimum   € length  Lmin  (set  at  400ms  in  our  experiments)  
  20. 20. U-­‐DTW  global/local  constraints   •  no  global  constraints  are  applied  in  order  to  allow  for   matching  of  any  segment  among  both  sequences   •  Local  constraints  are  set  to  allow  warping  up  to  2X   (m,  n)   ⎧ D(m − 2,n) ⎪ (m-­‐2,  n-­‐1)   D(m,n) = max⎨ D(m,n − 2) + s(um ,v n ) (m-­‐1,  n-­‐1)   ⎪ ⎩ D(m − 2,n − 2) (m-­‐1,  n-­‐2)  
  21. 21. U-­‐DTW  computa$onal  savings   •  Computa$onal  savings  are  achieved  thanks  to:   1.  We  sample  the  distance/similarity  matrix  at   certain  possible  matching  start  points  (sesng   Synchroniza$on  points)   2.  Dynamic  programming  is  done  forward,   prunning  out  low  similarity  paths  
  22. 22. Synchroniza$on  points   •  Only  certain  (m,n)  posi$ons  are  analyzed  in   the  matrix  for  possible  matching  segments   –  Selected  not  to  loose  any  matching  segment   –  Op$mize  the  computa$onal  cost   •  Two  methods  are  followed:  horizontal  and   ver$cal  bands:   U   U   λ   τh   (m,n)   λ   2τh   (m,n)   π/4   V   λ   τd   V  
  23. 23. U-­‐DTW  Dynamic  Programming  
  24. 24. Forward  dynamic  programming   •  For  each  posi$on  (m,n):  3  possible  forward  paths  are   considered   (m+1,  n+2)   (m+1,  n+1)   (m+2,  n+1)   (m,  n)   •  The  forward  path  is  extended  forward  IIF:   –  Its  normalized  global  similarity  is  above  a  pruning  threshold   D(m,n) + s(m',n') S(m',n') = ≥ Thrprun M(m,n) +1 –   S(m',n')    is  greater  than  any  previous  path  in  that  loca$on                       €
  25. 25. U-­‐DTW  Dynamic  Programming  
  26. 26. U-­‐DTW  Dynamic  Programming  
  27. 27. Backward  path  algorithm   •  When  a  possible  matching  segment  is  found  in   the  forward  path,  the  same  is  done  backwards   star$ng  from  the  origina$ng  SP  posi$on.   (m,  n)   (m-­‐2,  n-­‐1)   (m-­‐1,  n-­‐1)   (m-­‐1,  n-­‐2)   The  same  procedure  is  followed  as  in  the  forward  path    
  28. 28. U-­‐DTW  Dynamic  Programming  
  29. 29. U-­‐DTW  Dynamic  Programming  
  30. 30. Computa$onal  savings  example   Barcelona   Barcelona  
  31. 31. Experimental  setup   •  We  asked  23  people  to  record  47   words  from  6  categories,  5  itera$ons   each:   XU ,V [n,i],i = 1...5, j = 1...47 Monuments   •  Simple  energy-­‐based  trimming   Family   eliminates  non-­‐speech  regions   € Events   •  We  simulate  acous$c  context  by   Ci$es   a+aching  different  start-­‐end  audio   People   sequences  to  Xu,v.   Nature  
  32. 32. Experimental  setup  (II)   •  Signals  are  parameterized  with  10MFCC  every   10ms   •  Each  word  Xu  is  compared  to  all  words  Xv  from   the  same  speaker  (234  comparisons)  and  the   closest  one  is  retrieved   argmin m, j D(XU [n,i], X V [m, j]) | (n,i) ≠ (m, j)  We  get  a  hit  m=n,  a  miss  otherwise   •  Tests  were  performed  on  an  Ubuntu  Linux  PC   € @2.4GHz.  
  33. 33. Comparing  systems   •  Standard  DTW   –  Compare  the  sequences  without  any  added   acous$c  context  (i.e.  prior  knowledge  of  start-­‐end   points)   •  Segmental  DTW  (Park  and  Glass,  2005)   –  Minimum  segment  length  of  500ms   –  Band  size  of  70ms,  50%  overlap   –  Used  2  distances:  Euclidean  and  1-­‐inner  product  
  34. 34. Performance  evalua$on   Used  metrics:   –  Accuracy:  percentage  of  words  correctly  matched  (Xu  y  Xv   are  different  itera$ons  of  the  same  word).   Acc = ∑ correct matches ⋅ 100 all matches –  Average  processing  $me  per  sequence  pair  (Xu-­‐Xv)   (excluding  parameteriza$on)   € Time = ∑ time(D(X U [n,i],X V [m, j])) ⋅ 100 # matches –  Average  ra$o  of  frame-­‐pair  distances  within  each   sequence-­‐pair  cost  matrix.     € Ratio = ∑ computed(d(X U [n,i], X V [m, j])) ⋅ 100 MN
  35. 35. Results   Algorithm   Accuracy   Avg.  ;me   ra;o   Segmental  DTW  w/  Eucl.   80.61%   82.7ms   1   Segmental  DTW  w/  inner  prod.   74.62%   86.7ms   1   U-­‐DTW  horiz.  bands   89.53%   10.6ms   0.51   U-­‐DTW  diag.  bands   89.34%   9.0ms   0.42   Standard  DTW   95.42%   0.6ms   1  
  36. 36. Effect  of  the  Cutout  Threshold  
  37. 37. Conclusions  and  future  work   •  We  propose  a  novel  algorithm  called  U-­‐DTW   for  unconstrained  pa+ern  discovery  in  speech     •  We  show  it  is  faster  and  more  accurate  than   exis$ng  alterna$ves   •  We  are  star$ng  to  test  the  algorithm  for   unrestricted  audio  summariza$on  
  38. 38. MuViSync   AudioVisual  Music  Synchroniza$on   Xavier  Anguera,  Robert  Macrae  and  Nuria  Oliver  
  39. 39. People  enjoy  listening  to  their   favorite  music  everywhere…     …at  home,  …   …on  the  go,  …   …or  in  a  party  with  friends  
  40. 40. Users  increasingly  have  a  personal   mp3  music  collec$on…   …but  it  usually  contains   ‘only’  music.     What  if  you  could   watch  the  video  clip   of  any  of  our  songs   while  listening  to  it?  
  41. 41. You  could  go  to  sites  like  YouTube…   …but  the  audio   quality  is  much   worse  that  in   your  mp3…     What  if  you  could  listen  to  our  high  quality   mp3  music  while  watching  the  video  clips?  
  42. 42. MuViSync:       Music  and  Video  Synchroniza$on  system   streaming   MuViSync   Video  clip   local   MuViSync  synchronizes  audio   and  video  from  two  different   sources  and  plays  them   Personal  Music   together  in-­‐sync  
  43. 43. Applica$on  scenarios   •  Watch  on  TV  your  favorite  music   –  Personal  music  synchroniza$on  with  video  clips   either  local  or  streamed   •  Watch  on  your  iPhone  your  music   –  Personal  music  synchroniza$on  by  streaming  the   video  into  the  iPhone   •  Iden0fy  and  watch  any  music   –  Combined  with  songID  technology,  either  at  home   or  on  the  go.  
  44. 44. MuViSync  applica$on   •  We  have  developed  a  prototype  applica0on  for   Windows/mac,  and  soon  for  Iphone.  
  45. 45. Alignment  algorithm  requirements   •  Perform  an  alignment  between  the  mp3  music   and  the  Video’s  audio  track   •  Ini$ally  only  par$al  knowledge  is  available   from  both  sources  (life  recording  or  buffering)   •  Alignment  has  to  be  done  online  and  in  real-­‐ $me   •  Emphasis  is  needed  on  the  user  sa$sfac$on   when  playing  the  video.  
  46. 46. Applica$on  testbed   •  We  use  320  music  videos  (Youtube)  +  their   corresponding  mp3  files   •  A  supervised  ground-­‐truth  alignment  was  performed   using  offline  DTW  and  checking  for  consistency   •  Audio  is  processed  every  100ms  (200ms  window)  and   chroma  features  are  extracted  
  47. 47. MuViSync  online  alignment  algorithm   1.  Ini$al  path  discovery   –  Both  signals  (audio  and  video)  are  buffered,  features   are  extracted  and  an  ini$al  alignment  is  found   2.  Real-­‐$me  online  alignment   –  An  incremental  alignment  is  computed   3.  Alignment  post-­‐processing  to  ensure  a  smooth   playback  of  the  aligned  video.   Audio  +  feats   Ini$al  path   extrac$on   1)   discovery   ta   tv   Feats   2)   Real-­‐$me   alignment   extrac$on   alignment  
  48. 48. Ini$al  path  discovery     (online  mp3  playback    +  video  buffering)   Sync  request   Audio  from  the  mp3  file   Video  buffering  end   Audio  available  from  the  video  
  49. 49. Ini$al  path  discovery   •  A  segment  of  the  audio  and  the  buffered  video  are   checked  for  alignment  using  forward-­‐DTW   •  The  global  similarity  D(m,n)  at  each  loca$on  (m,n)  is   normalized  by  the  length  of  the  op$mum  path  to   that  loca$on   •  At  each  step,  all  paths  with  D’(m,n)  <  Dave(*,n)  are   pruned.     •  The  ini0al  alignment  is  selected  when  only  one  path   survives  or  the  sync  0me  is  reached.  
  50. 50. Ini$al  path  discovery   Audio  $me   Audio  being  played  from  mp3   alignment  buffer   (about  1s)   Audio  available  from  the  video  
  51. 51. Ini$al  path  discovery   Audio  being  played  from  mp3   Audio  available  from  the  video  
  52. 52. Ini$al  path  discovery   Audio  being  played  from  mp3   Audio  available  from  the  video  
  53. 53. Audio  being  played  from  mp3   Ini$al  path  discovery   Audio  available  from  the  video  
  54. 54. Real-­‐$me  online  alignment   •  Star$ng  from  the  ini$al  alignment  we  itera$vely   compute:     1.  Locally  op$mum  forward  path  for  L  steps:  p1…pL   using  a)  local  constraints  (no  dynamic  programming)   2.  Backward  (standard)  DTW  from  pL  to  p1  using  b)  local   constraints   3.  Add  the  ini$al  p/2  steps  to  the  final  path,  and  start  1)   from  pL/2  un$l  the  playback  ends  
  55. 55. Real-­‐$me  online  alignment   Audio  being  played  from  mp3   Audio  available  from  the  video  
  56. 56. Real-­‐$me  online  alignment   Audio  being  played  from  mp3   pL   1)Forward  locally   best  path  with  L=8   p1   Audio  available  from  the  video  
  57. 57. Real-­‐$me  online  alignment   Audio  being  played  from  mp3   pL   2)stardard  DTW   p1   Audio  available  from  the  video  
  58. 58. Real-­‐$me  online  alignment   Audio  being  played  from  mp3   3)Move  forward  the   new  star$ng  point   p1   Audio  available  from  the  video  
  59. 59. Alignment  postprocessing   •  Alignment  es$mates  every  100ms  are  not  enough   to  drive  25/30  fps  video   •  An  interpola$on  of  the  points  +  averaging  over  5   seconds  gives  the  projec$on  es$mate  for  current   playback  
  60. 60. Experiments   •  We  use  320  videos+mp3,  aligned  using  offline  DTW  and   manually  checked  for  consistency.   •  Accuracy  is  computed  as  the  %  of  songs  with  average   error  <  some  ms.   Average  accuracy  @100ms  for  different  video  buffer  lengths    
  61. 61. Experiments  
  62. 62. Video  Duplicate  Detec$on   Xavier  Anguera  and  Pere  Obrador  
  63. 63. Let’s  say  you’re  looking  for  the   Bush  a+ack  video…  
  64. 64. …and  you   get  11,100   results.  
  65. 65. …ajer   40  minutes...   watching  many  of   the  videos  returned   you  no$ce  that    many  are  similar,  i.e.  near  duplicates   27%  in  average  in  Youtube  [Wu  et  al.,  2007]   12%  in  average  in  Youtube  [Anguera  et  al,  2009]  
  66. 66. Near  duplicate  (NDVC)  defini$on   •  Iden$cal  or  approximately  iden$cal  videos,   that  differ  in  some  feature:   –  file  formats,  encoding  parameters   –  photometric  varia$ons  (color,  ligh$ng  changes)   –  overlays  (cap$on,  logo,  audio  commentary)   –  edi$ng  opera$ons  (frames  add/remove)   –   seman$c  similarity   NDVC  are  videos  that  are  “essen(ally  the  same”  
  67. 67. Near  duplicates(NDVC)  vs.  Video  copies   •  These  two  concepts  are  not  totally  well   discriminated  in  the  bibliography.   •  Video  copy:  exact  video  segment,  with  some   transforma$ons  on  it   •  Near  duplicate:  similar  videos  on  the  same   topic  (different  view  points,  seman$cally   similar  videos,  …)   In  our  research  we  approach  the   video  copy  detec;on  
  68. 68. Examples  of  video  copies  
  69. 69. Use  Scenarios:  Copyright  law   enforcement   Detec$on  of  copyright   infringing  videos  in  online   video  sharing  sites   In  a  recent  study  we  found  that  in  average  12%  of  search   results  in  YouTube  are  copies  of  the  same  video  
  70. 70. Use  Scenarios:  Video  forensics  for   illegal  ac$vi$es   Discover  illegal  content  hidden   within  other  videos   Currently  police  forces  usually  have  to  manually   scroll  through  ALL  materials  in  pederasty  cases   searching  for  evidence.  
  71. 71. Use  Scenarios:  Database   management   Video  excerpts  used  several  $mes   Database  management/op$miza$on  and   helping  in  searches  over  historic  contents  
  72. 72. Use  Scenarios:  adver$sement   detec$on  and  management   Adver$sement  detec$on/iden$fica$on   Programming  analysis  
  73. 73. Use  Scenarios:  Informa$on  overload   reduc$on   Improved  (more  diverse)  video  search   results  by  clustering  all  video  duplicates.   George  Bush   Ajer   clustering   Before  clustering  
  74. 74. Steps  in  Video  Duplicate  detec$on   1.  Indexing  of  the  reference  videos   A.  Obtain  features  represen$ng  the  video   B.  Store  these  features  in  a  scalable  manner   2.  Search  of  queries  within  the  reference  set   OFFLINE   Ref  videos   References   Features   Feature  extrac$on   indexing   Database   Query     Search  for   video   Feature  extrac$on   duplicates   ONLINE  
  75. 75. Ways  to  approach  near-­‐duplicate  video   detec$on   •  Local  features   –  Extracted  from  selected  frames  in  the  videos   –  Focus  on  local  characteris$cs  within  those  frames   •  Global  features   –  Extracted  from  selected  frames  or  from  all  the   video     –  Focus  on  overall  characteris$cs  
  76. 76. Local  features   •  Comes  from  the  previous  knowledge  on  image   copy  detec$on/near  duplicates  detec$on   •  Steps:   –  Keyframes  are  first  extracted  from  the  videos  at   regular  intervals  or  by  detec$ng  shots   –  Local  features  are  obtained  for  these  keyframes:   •  SIFT   •  SURF   •  HARRIS   •  …  
  77. 77. Global  Features   •  Features  are  extracted  either  from  the  whole   video  or  from  keyframes  by  looking  at  the   overall  image  (not  at  par$cular  points).   In  our  work  we  extract  them  from  the   whole  video  
  78. 78. Mul$modal  video  copy  detec$on   •  Most  works  use  only  video/images  informa$on   –  They  prefer  local  features  for  their  robustness   •  We  introduce  audio  informa$on  by  combining   global  features  from  both  the  audio  and  video   tracks   •  We  are  also  experimen$ng  on  fusing  local   features  with  global  features  (work  in  progress)  
  79. 79. Mul$modal  global  features   •  We  use  features  based  on  the  changes  in  the   data-­‐>  more  robust  to  transforma$ons   •  Video:   –  Hue  +  satura$on  interframe  change   –  Lightest  and  darkest  centroid  interframe  distance   •  Audio:   –  Bayesian  informa$on  criterion  (BIC)  between  adjacent   segments   –  Cross-­‐BIC  between  adjacent  segments   –  Kullback-­‐Leibler  divergence  (KL2)  between  adjacent   segments  
  80. 80. Hue+Satura$on  interframe  change   1.  Transform  the  colorspace  from  RGB  to  HSV  (Hue +Satura$on+Value)  
  81. 81. Hue+Satura$on  interframe  change   2.  Compute  for  each  2  consecu$ve  frames  their  HS   histogram  and  compute  their  intersec$on  as:  
  82. 82. Lightest and darkest centroid interframe distance 1.  Find  the  lightest  and  darkest  regions  in  each   frame  and  obtain  its  centroid  
  83. 83. Lightest and darkest centroid interframe distance We  compute  the  euclidean  distance  between   each  two  adjacent  frames,  obtaining  two   global  feature  streams  
  84. 84. Acous$c  features   •  Compute  some  acous$c  distance  between   adjacent  acous$c  segments   Segment  A   Segment  B   GMM  A   GMM  B   GMM  A+B  
  85. 85. Acous$c  features  (II)   •  Likelihood-­‐based  metrics:   –  Bayesian  Informa$on  Criterion   –  Cross-­‐BIC   •  Model  distance  metrics:   –  Kullback-­‐Leibler  divergence  (KL2)  
  86. 86. Acous$c  features  (III)   •  For  example:  the  Bayesian  Informa$on   Criterion  (BIC)  output:  
  87. 87. Search  for  full  copies   •  For  each  video-­‐query  pair  we  compute  the   correla$on  of  each  feature  pair   Reference   FFT   X IFFT   Find  peaks   Possible   FFT   copy   •  We  then  find  the  posi$ons  with  high  similarity   (peaks).  
  88. 88. Mul$modal  fusion   •  When  mul$ple  modali$es  are  available,  fusion   is  performed  on  the  correla$ons  
  89. 89. Output  score   •  The  resul$ng  score  is  computed  by  weighted   sum  of  the  different  modali$es’  normalized   dot  product  at  the  found  peak   •  Automa$c  weights  are  obtained  via  
  90. 90. Finding  subsegments  of  the  query   •  The  previously  described  algorithm  considers  the  whole   query  matches  a  por$on  of  the  reference  videos   •  To  avoid  such  restric$on  a  modifica$on  to  the  algorithm   first  splits  the  query  into  overlaping  20s  segments   •  By  accumula$ng  the  resul$ng  peaks  for  each  segment  we   can  obtain  the  main  delay  and  its  segment  
  91. 91. Algorithm  performance  evalua$on   •  To  test  the  algorithm  we  used  the  MUSCLE-­‐ VCD  database:     –  Over  100  hours  of  reference  videos  from  the   SoundVision  group  (Nederlands)   –  2  test  sets   •  ST1:  15  query  videos  where  the  whole  query  is   considered   •  ST2:  3  videos  with  21  segments  appearing  in  the   reference  database   h+p://www-­‐­‐bench/benchMuscle.html  
  92. 92. MUSCLE-­‐VCD  transforma$on  examples  
  93. 93. Evalua$on  metrics   •  We  use  the  same  metrics  as  in  the  MUSCLE-­‐ VCD  benchmark  tests  
  94. 94. Evalua$on  metrics  (II)   •  We  also  use  the  more  standard  Precision  and   recall  metrics  
  95. 95. Evalua$on  results  
  96. 96. Evalua$on  results  histogram  for  ST1  
  97. 97. Youtube  reranking  applica$on   •  We  downloaded  all   videos  searching  for   the  top  20  most   viewed  and  20  most   visited  videos  
  98. 98. Youtube  reranking  applica$on   •  We  applied  mul$modal  copy  detec$on  and   grouped  all  near  duplicates  
  99. 99. Youtube  Reranking  test     •  Results  show  how   some  videos  have   mul$ple  clear  copies   that  can  boost  their   ranking  once  clustered  
  100. 100. Thanks  for  your  aHen;on   xanguera@$   Linkedin:  h+p://   Twi+er:  h+p://   Website:  h+p://