Multimedia Information Retrieval: What is it, and why isn't ...


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Multimedia Information Retrieval: What is it, and why isn't ...

  1. 1. Video Search: Opportunities and Challenges Keynote Speech at ACM MIR Workshop ACM Multimedia Conference 2005 Singapore Dr. Ramesh R. Sarukkai Yahoo! Search { [email_address] [email_address] }
  2. 2. About the Presenter <ul><li>Dr. Ramesh Rangarajan Sarukkai currently heads up Video Search engineering at Yahoo where he manages video search/core media search infrastructure. Over his 6+ years tenure at Yahoo, Dr. Sarukkai has grown and managed different teams including the Small Business (e-commerce stores, hosting, domains) billing platform, and architected wireless applications & voice portals including 1-800-my-yahoo. Prior to Yahoo, he has worked in Kurzweil A.I., IBM Watson Lab, and has collaborated with HP Labs. </li></ul><ul><li>Dr. Sarukkai is the author of the book entitled “Foundations of Web Technology” (Kluwer/Springer 2002). In addition, he holds the first patent on large vocabulary voice web browsing, and has a number of patents awarded/pending in the areas of speech, natural language modeling, personalized search, information retrieval, optimal sponsored search, media and wireless technologies. Dr. Sarukkai also served on the World Wide Web Consortium (W3C) Voice Browser working group, recently chaired a panel in World Wide Web 2005 conference (Japan) on networking effects of the Web, and has contributed as a reviewer/Program Committee in a number of leading conferences/journals. His current interests include next-generation media search, social annotation, networking effects on the Web, emergent web technologies and new business models. </li></ul>
  3. 3. Outline <ul><li>Introduction </li></ul><ul><li>Market Trends </li></ul><ul><li>Video Search </li></ul><ul><li>Opportunities </li></ul><ul><li>Challenges </li></ul><ul><li>Conclusion </li></ul>
  4. 4. Introduction <ul><ul><li>Key questions discussed in this talk: </li></ul></ul><ul><ul><ul><li>Why is video search important? </li></ul></ul></ul><ul><ul><ul><li>What broadly constitutes video search? </li></ul></ul></ul><ul><ul><ul><li>How is Web Video Search different? </li></ul></ul></ul><ul><ul><ul><li>What are some opportunities and challenges? </li></ul></ul></ul>
  5. 5. Outline <ul><li>Introduction </li></ul><ul><li>Market Trends </li></ul><ul><li>Video Search </li></ul><ul><li>Opportunities </li></ul><ul><li>Challenges </li></ul><ul><li>Conclusion </li></ul>
  6. 6. Market Trends <ul><li>Whoever controls the media - the images - controls the culture. </li></ul><ul><li>- Allen Ginsberg </li></ul><ul><li>American Poet </li></ul>
  7. 7. Market Trends <ul><li>Broadband doubling over next 3-5 years </li></ul><ul><li>Video enabled devices are emerging rapidly </li></ul><ul><li>Emergence of mass internet audience </li></ul><ul><li>Mainstream media moving to the Web </li></ul><ul><li>International trends are similar </li></ul><ul><li>Money Follows… </li></ul>
  8. 8. Market Trends <ul><ul><li>How many of you are aware of video on the Web? </li></ul></ul><ul><ul><li>How many have viewed a video on the Web in </li></ul></ul><ul><ul><ul><li>The last 3 months? </li></ul></ul></ul><ul><ul><ul><li>The last 6 months? </li></ul></ul></ul><ul><ul><ul><li>Ever? </li></ul></ul></ul><ul><ul><li>Would you watch video on your devices (ipod/wireless)? </li></ul></ul><ul><ul><li>How many of you have produced video (personal or otherwise) recently? </li></ul></ul><ul><ul><li>How many of you have shared that with your friends/community? Would you have liked to? </li></ul></ul>
  9. 9. Market Trends <ul><ul><li>How many of you are aware of video on the Web? </li></ul></ul><ul><ul><ul><li>Large portion of online users </li></ul></ul></ul><ul><ul><li>How many have viewed a video on the Web in </li></ul></ul><ul><ul><ul><li>The last 3 months? </li></ul></ul></ul><ul><ul><ul><ul><li>50% </li></ul></ul></ul></ul><ul><ul><ul><li>The last 6 months? </li></ul></ul></ul><ul><ul><ul><li>Ever? </li></ul></ul></ul><ul><ul><li>Would you watch video on your devices (ipod/wireless)? </li></ul></ul><ul><ul><ul><li>1M downloads in 20 days (iPod) </li></ul></ul></ul><ul><ul><li>How many of you have produced video (personal or otherwise) recently? </li></ul></ul><ul><ul><ul><li>Continuing to skyrocket with digital camera phones/devices </li></ul></ul></ul><ul><ul><li>How many of you have shared that with your friends/community? Would you have liked to? </li></ul></ul><ul><ul><ul><li>Huge interest & adoption in viral communities </li></ul></ul></ul><ul><ul><li>* Source: Forrester Report </li></ul></ul>
  10. 10. Market Trends <ul><ul><li>Technology more media friendly </li></ul></ul><ul><ul><ul><li>Storage costs plummeting (GB  TB) </li></ul></ul></ul><ul><ul><ul><li>CPU speed continuing to double (Moore’s law) </li></ul></ul></ul><ul><ul><ul><li>Increased bandwidth </li></ul></ul></ul><ul><ul><ul><li>Device support for media </li></ul></ul></ul><ul><ul><ul><li>Adding media to sites drives traffic </li></ul></ul></ul><ul><ul><ul><li>Web continues to propel scalable infrastructure for media products/communities </li></ul></ul></ul>
  11. 11. Market Trends <ul><ul><li>Democratization of Mass Media in the 21 st Century (aka the long tail) </li></ul></ul><ul><ul><ul><li>Of the People (media is ultimately defined by users) </li></ul></ul></ul><ul><ul><ul><li>By the people (production,tagging) </li></ul></ul></ul><ul><ul><ul><li>For the people (consumption, devices, personalization) </li></ul></ul></ul><ul><ul><ul><li>- Abraham Lincoln as a Media Mogul in the 21 st Century! </li></ul></ul></ul><ul><ul><ul><li>“ The Long Tail” was coined by Chris Andersen’s article in the Wired magazine. </li></ul></ul></ul>
  12. 12. Market Trends <ul><li>Mainstream media migrating online & interactive </li></ul><ul><ul><ul><li>Media content online is growing </li></ul></ul></ul><ul><ul><ul><li>Video Streaming continuing to grow aggressively online. </li></ul></ul></ul><ul><ul><ul><li>More interactive in nature (user voting): </li></ul></ul></ul><ul><ul><ul><ul><li>MTV Interactive </li></ul></ul></ul></ul><ul><ul><ul><ul><li>American Idol </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Reality shows </li></ul></ul></ul></ul>
  13. 13. Market Trends <ul><li>Video search is a key part of the future of digital media </li></ul>
  14. 14. Outline <ul><li>Introduction </li></ul><ul><li>Market Trends </li></ul><ul><li>Video Search </li></ul><ul><li>Opportunities </li></ul><ul><li>Challenges </li></ul><ul><li>Conclusion </li></ul>
  15. 15. Video Search <ul><li>What is Video Search? </li></ul><ul><li>Multimedia? As far as I'm concerned, it's reading with the radio on! </li></ul><ul><li>- Rory Bremner </li></ul><ul><li>Actor/Writer </li></ul>
  16. 16. Video Search Meta-Content Video Production Unstructured Data Video Consumption /Community Content Features + Meta-Content Structured + Unstructured Data
  17. 17. Video Search <ul><ul><li>Media Information Retrieval </li></ul></ul><ul><ul><ul><li>Been around since the late 1970s </li></ul></ul></ul><ul><ul><ul><ul><li>Text Based/DB </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Issues: Manual Annotation, Subjectivity of Human Perception </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><li>Content Based </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Color, texture, shape, face detection/recognition, speech transcriptions, motion, segmentation boundaries/shots </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>High Dimensionality </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Limited success to date. </li></ul></ul></ul></ul></ul><ul><ul><ul><li>Citations: </li></ul></ul></ul><ul><ul><ul><li>“ Image Retrieval: Current Techniques, Promising Directions, and Open Issues” [Rui et al 99] </li></ul></ul></ul>
  18. 18. Video Search <ul><ul><li>Active Research Area </li></ul></ul><ul><ul><ul><li>Graph from “A new perspective on Visual Information Retrieval”, Horst Eidenberger, 2004 </li></ul></ul></ul><ul><ul><ul><li>Black: “Image Retrieval”; Grey:”Video Retrieval”; IEEE Digital Library </li></ul></ul></ul>
  19. 19. Video Search <ul><ul><ul><li>Traditional 3-Step Overview: </li></ul></ul></ul><ul><ul><ul><li>Meta-Data/Visual Feature Extraction </li></ul></ul></ul><ul><ul><ul><li>Multi-dimensional indexing </li></ul></ul></ul><ul><ul><ul><li>Retrieval System design </li></ul></ul></ul>
  20. 20. Video Search <ul><ul><li>Popular features/techniques: </li></ul></ul><ul><ul><ul><ul><li>Color, Shape, Texture, Shape descriptors </li></ul></ul></ul></ul><ul><ul><ul><ul><li>OCR, ASR </li></ul></ul></ul></ul><ul><ul><ul><ul><li>A number of prototype or research products with small data sets </li></ul></ul></ul></ul><ul><ul><ul><ul><li>More researched for visual queries </li></ul></ul></ul></ul>
  21. 21. Video Search: Features 1970 2000 1990 1980 2010 Texture: Autocorrelation; Wavelet transforms; Gabor Filters Shape: Edge Detectors; Moment invariants; Animate Vision Marr; Finite Element Methods; Shape from Motion; Color: Color Moments Color Histograms Color Autocorrelograms Segmentation: Scene segmentation; Scene Segmentation; Shot detection; OCR: Modeling; Successful OCR deployments; Face: Face Detection algorithms; Neural Networks; EigenFaces ASR: Acoustic analysis; HMMS; N-grams; CSR; LVCSR; Domain Specific; NIST Video TREC Starts Media IR systems Web Media Search
  22. 22. Video Search: Features <ul><ul><li>Color </li></ul></ul><ul><li>Robust to background </li></ul><ul><li>Independent of size, orientation </li></ul><ul><li>Color Histogram [Swain & Ballard] </li></ul><ul><li>“ Sensitive to noise and sparse”- Cumulative Histograms [Stricker & Orgengo] </li></ul><ul><li>Color Moments </li></ul><ul><li>Color Sets: Map RGB Color space to Hue Saturation Value, & quantize [Smith, Chang] </li></ul><ul><li>Color layout- local color features by dividing image into regions </li></ul><ul><li>Color Autocorrelograms </li></ul><ul><ul><li>Texture </li></ul></ul><ul><li>One of the earliest Image features [Harlick et al 70s] </li></ul><ul><li>Co-occurrence matrix </li></ul><ul><li>Orientation and distance on gray-scale pixels </li></ul><ul><li>Contrast, inverse deference moment, and entropy [Gotlieb & Kreyszig] </li></ul><ul><li>Human visual texture properties: coarseness, contrast, directionality, likeliness, regularity and roughness [Tamura et al] </li></ul><ul><li>Wavelet Transforms [90s] </li></ul><ul><li>[Smith & Chang] extracted mean and variance from wavelet subbands </li></ul><ul><li>Gabor Filters </li></ul><ul><li>And so on </li></ul><ul><ul><li>Region Segmentation </li></ul></ul><ul><li>Partition image into regions </li></ul><ul><li>Strong Segmentation: Object segmentation is difficult. </li></ul><ul><li>Weak segmentation: Region segmentation based on some homegenity criteria </li></ul><ul><li>Scene Segmentation </li></ul><ul><li>Shot detection, scene detection </li></ul><ul><li>Look for changes in color, texture, brightness </li></ul><ul><li>Context based scene segmentation applied to certain categories such as broadcast news </li></ul>
  23. 23. Video Search: Features <ul><ul><li>Face </li></ul></ul><ul><li>Face detection is highly reliable </li></ul><ul><li>- Neural Networks [Rwoley] </li></ul><ul><li>- Wavelet based histograms of facial features [Schneiderman] </li></ul><ul><li>Face recognition for video is still a challenging problem. </li></ul><ul><li>- EigenFaces: Extract eigenvectors and use as feature space </li></ul><ul><li>OCR </li></ul><ul><li>OCR is fairly successful technology. </li></ul><ul><li>Accurate, especially with good matching vocabularies. </li></ul><ul><li>Script recognition still an open problem. </li></ul><ul><li>ASR </li></ul><ul><li>Automatic speech recognition fairly accurate for medium to large vocabulary broadcast type data </li></ul><ul><li>Large number of available speech vendors. </li></ul><ul><li>Still open for free conversational speech in noisy conditions. </li></ul><ul><li>Shape </li></ul><ul><li>Outer Boundary based vs. region based </li></ul><ul><li>Fourier descriptors </li></ul><ul><li>Moment invariants </li></ul><ul><li>Finite Element Method (Stiffness matrix- how each point is connected to others; Eigen vectors of matrix) </li></ul><ul><li>Turing function based (similar to Fourier descriptor) convex/concave polygons[Arkin et al] </li></ul><ul><li>Wavelet transforms leverages multiresolution [Chuang & Kao] </li></ul><ul><li>Chamfer matching for comparing 2 shapes (linear dimension rather than area) </li></ul><ul><li>3-D object representations using similar invariant features </li></ul><ul><li>Well-known edge detection algorithms. </li></ul>
  24. 24. Video Search: Video TREC <ul><ul><ul><li>Overview: </li></ul></ul></ul><ul><ul><ul><ul><li>Shot detection, story segmentation, semantic feature extraction, information retrieval </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Corpora of documentaries, advertising films </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Broadcast news added in 2003 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Interactive and non-interactive tests </li></ul></ul></ul></ul><ul><ul><ul><ul><li>CBIR features </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Speech transcribed (LIMSI) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>OCR </li></ul></ul></ul></ul>
  25. 25. Video Search: Video TREC <ul><ul><li>1 st TREC (2001) </li></ul></ul><ul><ul><ul><li>Mean Average Precision @100 items (MAP) 0.033 (in general category) </li></ul></ul></ul><ul><ul><ul><li>Transcript only was better than transcript+image aspects+ASR </li></ul></ul></ul><ul><ul><ul><li>Also there were different types of test: known items versus general. Known items had at least one result. </li></ul></ul></ul><ul><ul><li>2 nd TREC (2002) </li></ul></ul><ul><ul><ul><li>Interactive runs to rephrase queries </li></ul></ul></ul><ul><ul><ul><li>Again text based on only ASR was the best performing system </li></ul></ul></ul><ul><ul><ul><li>Mean Average precision of 0.137 </li></ul></ul></ul><ul><ul><ul><li>Leading system employed multiple systems: TF-IDF variants (Mean Average Precision MAP 0.093). </li></ul></ul></ul><ul><ul><ul><li>Other was boolean with query expansion (0.101 MAP) </li></ul></ul></ul><ul><ul><ul><li>OCR was not particularly applicable; Phonetic ASR did not help. </li></ul></ul></ul><ul><ul><li>3 rd TREC (2003) </li></ul></ul><ul><ul><ul><li>Data changes radically (Broadcast news added – CNN, ABC, CSPAN) </li></ul></ul></ul><ul><ul><ul><li>Baseline ASR + CC  MAP 0.155 </li></ul></ul></ul><ul><ul><ul><li>ASR + CC + VOCR + Image Similarity + Person X retrieval  MAP 0.218 </li></ul></ul></ul><ul><ul><ul><li>* “Successful Approaches in the TREC Video Retrieval Evaluations”, Alexander Hauptmann, Michael Christel, ACM Multimedia 2004. </li></ul></ul></ul>
  26. 26. Video Search: Browsing <ul><ul><ul><li>Search by text </li></ul></ul></ul><ul><ul><ul><li>Navigation with customized categories </li></ul></ul></ul><ul><ul><ul><li>Random Browsing </li></ul></ul></ul><ul><ul><ul><li>Search by example </li></ul></ul></ul><ul><ul><ul><li>Search by Sketch </li></ul></ul></ul>
  27. 27. Video Search: <ul><ul><li>Pre-2000 or Research </li></ul></ul><ul><ul><ul><li>QBIC (IBM Almaden) </li></ul></ul></ul><ul><ul><ul><li>Photobook (MIT Media Lab) </li></ul></ul></ul><ul><ul><ul><li>FourEyes (MIT Media Lab) </li></ul></ul></ul><ul><ul><ul><li>Netra (UCSB Digital Library) </li></ul></ul></ul><ul><ul><ul><li>MARS (UCI) </li></ul></ul></ul><ul><ul><ul><li>PicToSeek ( </li></ul></ul></ul><ul><ul><ul><li>VisualSEEK (Columbia) </li></ul></ul></ul><ul><ul><ul><li>PicHunter (NEC) </li></ul></ul></ul><ul><ul><ul><li>ImageRover (BU) </li></ul></ul></ul><ul><ul><ul><li>WebSEEK (Columbia) </li></ul></ul></ul><ul><ul><ul><li>Virage (now Autonomy) </li></ul></ul></ul><ul><ul><ul><li>Visual RetrievalWare (Convera) </li></ul></ul></ul><ul><ul><ul><li>AMORE (NEC) </li></ul></ul></ul><ul><ul><ul><li>BlobWorld (UC Berkeley) </li></ul></ul></ul>
  28. 28. Video Search <ul><ul><li>We have discussed the background of video search, techniques used, and applications. It should be clear that “video search” is a fairly broad term. </li></ul></ul>
  29. 29. Video Search <ul><li>Note the differences in these broader definitions of video search: </li></ul><ul><ul><li>No mention of content based versus meta-data based. </li></ul></ul><ul><ul><li>No mention of the actual browsing paradigm. </li></ul></ul><ul><ul><li>Production and consumption are a key aspect of video search, and should be taken into core consideration in the models. </li></ul></ul><ul><ul><li>Device, personalization and on-demand needs addressed. </li></ul></ul>
  30. 30. Video Search <ul><li>Video Search has evolved a lot since its inception and opportunities much broader. </li></ul><ul><li>Multimedia technologists (We) are defining and influencing the emergent media culture! </li></ul>
  31. 31. Outline <ul><li>Introduction </li></ul><ul><li>Market Trends </li></ul><ul><li>Video Search </li></ul><ul><li>Opportunities </li></ul><ul><li>Challenges </li></ul><ul><li>Conclusion </li></ul>
  32. 32. Opportunities <ul><li>Engineering is the art of compromise, and there is always room for improvement in the real world; but engineering is also the art of the practical. Engineers realize that they must, at some point curtail design, and begin to manufacture or build </li></ul><ul><li> </li></ul><ul><li>- H. Petrovski </li></ul><ul><li> Invention by design (Harvard Univ. Press) </li></ul>
  33. 33. Opportunities <ul><li>The Internet Revolution </li></ul><ul><ul><ul><li>Exposed the internet backbone to the masses </li></ul></ul></ul><ul><ul><ul><li>Standardized publication of web media (HTML,CSS,XML) </li></ul></ul></ul><ul><ul><ul><li>Scaled to millions of users </li></ul></ul></ul><ul><ul><ul><li>Pre-dot-com bust- slow emergence of media search (notably AltaVista Audio-Visual search) </li></ul></ul></ul><ul><ul><ul><li>Post-dot-com bust </li></ul></ul></ul><ul><ul><ul><ul><li>Recent trends of community tagging especially for Images: Flickr (Y!), DeliCious, and so on. </li></ul></ul></ul></ul>
  34. 34. Opportunities <ul><ul><li>Media Search Now </li></ul></ul><ul><ul><ul><li>Yahoo! Image, Video, Audio, Podcast searches </li></ul></ul></ul><ul><ul><ul><li>Flickr(Y!) </li></ul></ul></ul><ul><ul><ul><li>AOL/SingingFish </li></ul></ul></ul><ul><ul><ul><li>BlinxTV </li></ul></ul></ul><ul><ul><ul><li>Google </li></ul></ul></ul><ul><ul><ul><li>Many smaller companies </li></ul></ul></ul>
  35. 35. Opportunities <ul><ul><li>Web Based Video Search </li></ul></ul><ul><ul><ul><li>Web Search meets Media Search </li></ul></ul></ul><ul><ul><ul><li>Leverage Meta-content from: </li></ul></ul></ul><ul><ul><ul><ul><li>Web (inferred meta-data) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Community (Tagging and user submission/mRSS) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Structured Meta-Data from different sources </li></ul></ul></ul></ul><ul><ul><ul><li>Augment with limited media analysis </li></ul></ul></ul><ul><ul><ul><ul><li>OCR, ASR, Classification </li></ul></ul></ul></ul>
  36. 36. Video Search Meta-Content Video Production Unstructured Data Video Consumption /Community Content Features + Meta-Content Structured + Unstructured Data Web Video Search Traditional Media Search
  37. 37. Opportunities <ul><ul><li>Yahoo! Media Search </li></ul></ul><ul><ul><ul><li>Billions of Images </li></ul></ul></ul><ul><ul><ul><li>Millions of Videos </li></ul></ul></ul><ul><ul><ul><li>Hundreds of Millions of users </li></ul></ul></ul><ul><ul><ul><li>Reasonable Precision rates </li></ul></ul></ul>
  38. 38. Opportunities <ul><ul><li>Example 1: </li></ul></ul><ul><ul><li>User query “zorro” </li></ul></ul><ul><ul><li>User 1 wants to see Zorro videos </li></ul></ul><ul><ul><li>User 2 wants to see Legend of Zorro movie clips </li></ul></ul><ul><ul><li>User 3 just wants to see home videos about Zorro </li></ul></ul><ul><ul><li>Can content based analysis help over structured meta-data query inference? </li></ul></ul>
  39. 40. Opportunities <ul><ul><li>Example 2: </li></ul></ul><ul><ul><li>For main-stream head content such as news videos. </li></ul></ul><ul><ul><li>Meta-data are fairly descriptive </li></ul></ul><ul><ul><li>Usually queried based on non-visual attributes. </li></ul></ul><ul><ul><li>Task: “Pull up recent Hurricane Katrina videos” </li></ul></ul>
  40. 42. Opportunities <ul><ul><li>Example 3: </li></ul></ul><ul><ul><li>Creative Home Video </li></ul></ul><ul><ul><li>Community video rendering! </li></ul></ul><ul><ul><li>The now “famous” Star Wars Kid </li></ul></ul><ul><ul><li>Example of “social buzz” combined with innovative tail content video production. </li></ul></ul>
  41. 43. Play Video 1 Play Video 2
  42. 44. Opportunities <ul><ul><li>A hard example: </li></ul></ul><ul><ul><li>Lets take an example: </li></ul></ul><ul><ul><li>“ Supposing you want to find videos that depict a monkey/chimp doing karate”! </li></ul></ul><ul><ul><li>CBIR Approach: </li></ul></ul><ul><ul><li>Train models for Chimps/Monkeys </li></ul></ul><ul><ul><li>Motion Analysis for Karate movement models </li></ul></ul><ul><ul><li>Many open issues/problems! </li></ul></ul>
  43. 45. Play Video
  44. 46. Opportunities <ul><li>What are the key opportunities for Video Search? </li></ul><ul><li>Automatic Speech Recognition (ASR) as a comparable case study </li></ul>
  45. 47. Opportunities <ul><li>Speech Recognition: Case study </li></ul><ul><ul><ul><li>From small experimental models to usable large scale products in restricted domains. </li></ul></ul></ul><ul><ul><ul><li>60’s – Very small experimental tests; Digit recognition; </li></ul></ul></ul><ul><ul><ul><li>70’s – Core Acoustic modeling work; Vocal Tract Modeling; Formant analysis; Signal Processing (FFT, MFCC) </li></ul></ul></ul><ul><ul><ul><li>80’s – HMMs; Statistical Language Models; Large collections of acoustic data. Very Large collection of non-acoustic Text Data; Speaker ID; </li></ul></ul></ul><ul><ul><ul><li>90’s – Large vocabulary speech recognition deployments in many domains (IVR, LVCSR) </li></ul></ul></ul><ul><ul><ul><li>Natural Language Understanding, Core Background Noise/Multiple speakers, Spontaneous Dialog still open research problems </li></ul></ul></ul>
  46. 48. Opportunities 1970 2000 1990 1980 2010 Acoustic Modeling Vocal Tract Modeling Formant Analysis DSP: FFT, MFCC Digit Recognizers Viterbi search; HMMs Sub-phonetic Models FSMs, Grammars, Statistical N-grams Digit Recognition; Small Vocabulary Discrete Medium Vocabulary Continuous Large Vocabulary Continuous LVCSR Telephony/Broadcast
  47. 49. Opportunities 1970 2000 1990 1980 2010 Acoustic Modeling Vocal Tract Modeling Formant Analysis DSP: FFT, MFCC Digit Recognizers Viterbi search; HMMs Sub-phonetic Models FSMs, Grammars, Statistical N-grams Digit Recognition; Small Vocabulary Discrete Medium Vocabulary Continuous Large Vocabulary Continuous LVCSR Telephony/Broadcast Millions of non-audio Text Data Hundreds of Hours of Audio Data Modeling Framework Milestones
  48. 50. Opportunities <ul><li>Speech Recognition Case Study Key Milestones: </li></ul><ul><ul><li>Foundations: HMMs, Viterbi search, Statistical Language Models </li></ul></ul><ul><ul><li>Creative use of Data: Millions of non-audio text data from Wall Street Journal Corpus applied for Language model training </li></ul></ul><ul><ul><li>Data Crunching at the lowest levels: Hundreds of hours of data applied to model thousands of sub-phonetic models </li></ul></ul>
  49. 51. Opportunities <ul><li>Speech Recognition Case Study: </li></ul><ul><li>Exploiting non-media sources of meta/text data are beneficial </li></ul><ul><li>Applying contextual restraints enables improved applicability </li></ul><ul><li>Large scale data crunching does indeed help solve hard problems </li></ul>
  50. 52. Opportunities <ul><ul><li>Web Video Search: </li></ul></ul><ul><ul><ul><li>Mass Adoption: Large Community of users/taggers/web developers </li></ul></ul></ul><ul><ul><ul><li>Different sources of Media and non-Media data </li></ul></ul></ul><ul><ul><ul><li>Highly Scalable systems </li></ul></ul></ul>
  51. 53. Opportunities <ul><li>The ASR takeaways can be applied to video search: </li></ul><ul><li>Exploiting non-media sources of meta/text data are beneficial (Web Meta-data, Community Tagging) </li></ul><ul><li>Applying contextual restraints enables improved applicability (User models, structured data,Tailored) </li></ul><ul><li>Large scale data crunching does indeed help solve hard problems (Throw Horsepower) </li></ul>
  52. 54. Opportunities <ul><ul><li>Web Browsing Paradigms </li></ul></ul><ul><ul><ul><li>Search by text (Very popular) </li></ul></ul></ul><ul><ul><ul><li>Navigation with customized categories (Very Popular) </li></ul></ul></ul><ul><ul><ul><li>Random Browsing (Popular for certain categories: Discovery) </li></ul></ul></ul><ul><ul><ul><li>Search by example (Not popular currently) </li></ul></ul></ul><ul><ul><ul><li>Search by Sketch (Not popular) </li></ul></ul></ul><ul><ul><ul><li>Combined usage models - users skipping to scan and then pick selective videos to play. </li></ul></ul></ul>
  53. 55. Opportunities <ul><ul><li>Leverage Community </li></ul></ul><ul><ul><li>Exploit Structured Meta-Data </li></ul></ul><ul><ul><li>Constrain with Context </li></ul></ul>
  54. 56. Opportunities <ul><ul><li>Tying into social network modeling for media search </li></ul></ul><ul><ul><ul><li>How do you define social networks? </li></ul></ul></ul><ul><ul><ul><li>How do you integrate into video search indexing/ranking? </li></ul></ul></ul><ul><ul><ul><li>How do you filter out the noise from the authoritative ones? </li></ul></ul></ul>
  55. 57. Opportunities <ul><ul><li>Exploit Structured Meta-Data </li></ul></ul><ul><ul><ul><li>Video has a number of different editorial sources; Exploit such structured data </li></ul></ul></ul><ul><ul><ul><li>Tailor results to users’ perceived needs </li></ul></ul></ul><ul><ul><ul><ul><li>Music Videos, Movies – aka traditional video markets </li></ul></ul></ul></ul>
  56. 58. Opportunities <ul><ul><li>Constrain with context </li></ul></ul><ul><ul><ul><li>Occam’s razor: Simpler the model, lesser the number of parameters to train, the more general the model. </li></ul></ul></ul><ul><ul><ul><li>Rather than throw in all meta + media feature vectors, tailor the model; </li></ul></ul></ul><ul><ul><ul><li>A decision tree based approach to picking the right algorithms. </li></ul></ul></ul><ul><ul><ul><ul><li>E.g. News broadcast will require certain type of content features and meta-data. </li></ul></ul></ul></ul><ul><ul><ul><li>Context can also be constrained by user (preferences), device parameters (media type, geo-location) </li></ul></ul></ul>
  57. 59. Outline <ul><li>Introduction </li></ul><ul><li>Market Trends </li></ul><ul><li>Video Search </li></ul><ul><li>Opportunities </li></ul><ul><li>Challenges </li></ul><ul><li>Conclusion </li></ul>
  58. 60. Challenges <ul><li>A man will occasionally stumble over truth, but usually manages to pick himself up, walk over or around it and carry on. </li></ul><ul><li>- Winston Churchill </li></ul>
  59. 61. Challenges <ul><ul><li>Theoretical Frameworks </li></ul></ul><ul><ul><li>Leverage CBIR solutions </li></ul></ul><ul><ul><li>Learn User Needs/Behavior </li></ul></ul><ul><ul><li>Inherit Web Search Problems </li></ul></ul>
  60. 62. Challenges <ul><ul><li>Theoretical Frameworks </li></ul></ul><ul><ul><ul><li>Establishing the right theoretical frameworks for video search is still a problem for a variety of reasons: </li></ul></ul></ul><ul><ul><ul><ul><li>Lack of proper classified data </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Large dimensionality; Different sources of meta-data. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Difficult to define the correct “perceived relevance” metric (“Semantic Gap*”) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Cost functions such as “average precision” are not differentiable & need to be approximated with heuristics </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Integrate with Application specific cost functions: E.g. search: Click through rates, clicks, video views, customer time spent </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Sometimes ranking needs to be based on user buzz, quality and other non-IR metrics </li></ul></ul></ul></ul><ul><ul><ul><li>* “Content-Based Image Retrieval at the End of the Early Years”, Arnold Smeulders et al, IEEE PAMI, Dec. 2000 </li></ul></ul></ul>
  61. 63. Challenges <ul><ul><li>Leverage CBIR </li></ul></ul><ul><ul><ul><li>Video TREC Summary: </li></ul></ul></ul><ul><ul><ul><ul><li>Text based retrieval still is much better than non-text based searches. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Visual queries have poor MAP, but amenable for interactive search. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Commercial detectors work okay with broadcast news, but not okay with documentaries with no commercials. (also face detection) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Visual features are seldom useful: Not without exceptions, edge detectors usually helped only in aircraft and animal categories when color features were not as discriminative. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>ASR was not as useful in tasks such as shot detection (as opposed to indexing whole clip). OCR is more closer to the actual occurrence of the word. </li></ul></ul></ul></ul><ul><ul><ul><li>“ Successful Approaches in the TREC Video Retrieval Evaluations”, Alexander Hauptmann, Michael Christel, ACM Multimedia 2004. </li></ul></ul></ul><ul><ul><ul><li>“ Video Retrieval using Speech and Image Information”, Alex Hauptmann et al. </li></ul></ul></ul>
  62. 64. Challenges <ul><ul><li>Leverage CBIR </li></ul></ul><ul><ul><ul><li>Apply what works in a tailored manner: </li></ul></ul></ul><ul><ul><ul><ul><li>Color, Texture, OCR, Speech Recognition </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Classification (in restricted domains) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Copyright infringement prevention </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Story-boarding/Shots </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Music/Speech Detection </li></ul></ul></ul></ul><ul><ul><ul><li>Apply other techniques in selective domains </li></ul></ul></ul><ul><ul><ul><ul><li>Broadcast news: ASR, Segmentation, Speaker Detection </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Motion for event detection restricted to time regions isolated by ASR </li></ul></ul></ul></ul>
  63. 65. Challenges <ul><ul><ul><li>Overcome the temptation to “solve the computer vision” problem*. </li></ul></ul></ul><ul><ul><ul><ul><li>Yet leverage computer vision techniques </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Today, CBIR is mostly image based </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Leverage motion flow information more </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Apply techniques in compressed domain (or leverage compressed data- e.g. MPEG-4 already computes local/global motion) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Strong Image Segmentation is hard- Consistent Segmentation across image sequences in video can be exploited. </li></ul></ul></ul></ul><ul><ul><ul><li>* “Guest Introduction: The changing Shape of Computer Vision in the Twenty-First Century”, M. Shah, IJCV, 2002 </li></ul></ul></ul>
  64. 66. Challenges <ul><ul><li>User Needs/Behavior </li></ul></ul><ul><ul><ul><li>Understand user needs </li></ul></ul></ul><ul><ul><ul><ul><li>Better modeling of query and user intent for video search </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Not much work has been done on user intent analysis for video search </li></ul></ul></ul></ul><ul><ul><ul><li>Leverage User Behavior </li></ul></ul></ul><ul><ul><ul><ul><li>Implicit versus explicit relevance feedback </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Challenges in tracking user feedback </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Blind/Implicit feedback is useful for limited domains </li></ul></ul></ul></ul><ul><ul><ul><li>How do you assess user relevance? </li></ul></ul></ul><ul><ul><ul><ul><li>More challenging for video since there is a temporal duration associated with the actions. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Explicit ratings are an option, but not incentivizable </li></ul></ul></ul></ul>
  65. 67. Challenges <ul><ul><li>Inherit Web Search Problems </li></ul></ul><ul><ul><ul><li>Meta-data Spam </li></ul></ul></ul><ul><ul><ul><ul><li>Users spam through Web or tagging </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Leverage Web Search solutions of spam detection based on text, linkage analysis, or user disrepute. </li></ul></ul></ul></ul><ul><ul><ul><li>Potential dilution due to “long tail” </li></ul></ul></ul><ul><ul><ul><li>How do you balance classic web search ranking with media based ranking? </li></ul></ul></ul>
  66. 68. Outline <ul><li>Introduction </li></ul><ul><li>Market Trends </li></ul><ul><li>Video Search </li></ul><ul><li>Opportunities </li></ul><ul><li>Challenges </li></ul><ul><li>Conclusion </li></ul>
  67. 69. Conclusion <ul><li>If you look back too much, you will soon be headed that way. </li></ul><ul><ul><li>- Unknown </li></ul></ul>
  68. 70. Conclusion <ul><ul><li>In this presentation, we covered: </li></ul></ul><ul><ul><li>Market trends that are fuelling large scale demand globally for video search. </li></ul></ul><ul><ul><li>The changing nature of video search largely driven by the Web. </li></ul></ul><ul><ul><li>Discussed different systems and the techniques that have been applied to media search. </li></ul></ul><ul><ul><li>Identified key areas of opportunities and challenges. </li></ul></ul>
  69. 71. Conclusion <ul><ul><li>Video Search is here to stay. </li></ul></ul><ul><ul><li>We (the attendees) have an opportunity to define and shape next generation media production, consumption, usage and ultimately the culture. </li></ul></ul>
  70. 72. Conclusion <ul><ul><li>Many thanks to: </li></ul></ul><ul><ul><ul><li>MIR Workshop organizers </li></ul></ul></ul><ul><ul><ul><ul><ul><li>Qi Tian </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Alex Jaimes </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>ACM Multimedia 2005 Conference Organizers </li></ul></ul></ul></ul></ul><ul><ul><ul><li>Y! Search </li></ul></ul></ul><ul><ul><ul><ul><ul><li>John Thrall </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Qi Lu </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Bradley Horowitz </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Video Search team, esp. Ruofei (Bruce) Zhang for help with references </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Y! Blore R&D: Sesha Shah, Srinivasan, YBL (Marc Davis et al) & YRL (Malcolm Slaney) </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>- Prof. S.K. Rangarajan for help with quotes </li></ul></ul></ul></ul></ul>
  71. 73. Conclusion <ul><ul><li>Thanks to the audience </li></ul></ul><ul><ul><li>Questions? </li></ul></ul>
  72. 74. Appendix: Leveraging Research & Industrial Communities <ul><ul><li>Scalable systems deployed and processing millions of media files </li></ul></ul><ul><ul><li>End-user testing with huge volume of traffic </li></ul></ul><ul><ul><li>General problem areas covered in R&D (e.g. VideoTREC) are very relevant </li></ul></ul><ul><ul><li>Selective application and proper modelling of information fusion is key. </li></ul></ul><ul><ul><li>How do we share or leverage that data to foster research? </li></ul></ul><ul><ul><li>How can industries help? </li></ul></ul><ul><ul><li>How can universities help? </li></ul></ul>