Your SlideShare is downloading. ×
Slides
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Slides

346
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
346
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Multimedia Segmentation and Summarization Dr. Jia-Ching Wang Honorary Fellow, ECE Department, UW-Madison
  • 2. Outline
    • Introduction
    • Speaker Segmentation
    • Video Summarization
    • Conclusion
  • 3. What is Multimedia?
    • Image
    • Video
    • Speech
    • Audio
    • Text
  • 4. Multimedia Everywhere
    • Fax machines: transmission of binary images
    • Digital cameras: still images
    • iPod / iPhone & MP3
    • Digital camcorders: video sequences with audio
    • Digital television broadcasting
    • Compact disk (CD), Digital video disk (DVD)
    • Personal video recorder (PVR, TiVo)
    • Images on the World Wide Web
    • Video streaming, video conferencing
    • Video on cell phones, PDAs
    • High-definition televisions (HDTV)
    • Medical imaging: X-ray, MRI, ultrasound
    • Military imaging: multi-spectral, satellite, microwave
  • 5. What is Multimedia Content?
    • Multimedia content: the syntactic and semantic information inherent in a digital material.
    • Example: text document
      • Syntactic content: chapter, paragraph
      • Semantic content: key words, subject, types of text document, etc.
    • Example: video document
      • Syntactic content: scene cuts, shots
      • Semantic content: motion, summary, index, caption, etc.
  • 6. Why We Need to Know Multimedia Content?
    • Why we need to know multimedia content?
      • Information processing, in terms of archiving, indexing, delivering, accessing and other processing, require in-depth knowledge of content to optimize the performance.
  • 7. How to Know Multimedia Content?
    • How to Know Multimedia Content?
      • Multimedia content analysis
        • The computerized understanding of the semantic/syntactic of a multimedia document
    • Multimedia content analysis usually involves
      • Segmentation
        • Segmenting the multimedia document into units
      • Classification
        • Classifying each unit into a predefined type
      • Annotation
        • Annotating the multimedia document
      • Summarization
        • Summarizing the multimedia document
  • 8. Multimedia Segmentation and Summarization
    • Multimedia segmentation
      • Syntactic content
    • Multimedia summarization
      • Semantic/syntactic content
    • The result of the temporal segmentation can benefit the video summarization
  • 9. Multimedia Segmentation
    • Image segmentation
    • Video segmentation
      • Scene change, shot change
    • Audio segmentation
      • Audio class change
    • Speech segmentation
      • Speaker change detection
    • Text Segmentation
      • word segmentation, sentence segmentation, topic change detection
  • 10. Multimedia Summarization
    • Image summarization
      • Region of interest
    • Video summarization
      • Storyboard, highlight
    • Audio summarization
      • Main theme in music, Corus in song, event sound in environmental sound stream
    • Speech summarization
      • Speech abstract
    • Text summarization
      • Abstract
  • 11. What is Speaker Segmentation?
    • It can also be called speaker change detection (SCD)
    • Assumption: there is no overlapping between any of the two speaker streams
    speaker1 speaker2 speaker3
  • 12. Supervised v.s. Unsupervised SCD
    • Supervised manner: acoustic data are made up of distinct speakers who are known a priori
      • Recognition based solution
    • Unsupervised manner: n o prior knowledge about the number and identities of speakers
      • Metric-based criterion
      • Model selection-based criterion
  • 13. Supervised Speaker Segmentation -- Gaussian Mixture Model
    • Gaussian mixture modeling (GMM)
    • Incoming audio stream is classified into one of D classes in a maximum likelihood manner at time t
    x is a d-dimensional random vector. , i =1,…, M is the mixture weight. ,the mean vector. ,the covariance matrix.
  • 14. Supervised Speaker Segmentation -- Hidden Markov Model
  • 15. Unsupervised Speaker Segmentation -- Sliding Window Strategy & Detection Criterion
    • Metric-based criterion (The dissimilarities between the acoustic feature vectors are measured)
      • Kullback-Leibler distance
      • Mahalanobis distance
      • Bhattacharyya distance
    • Model selection-based criterion
      • Bayesian information criterion (BIC)
  • 16. Bayesian Information Criterion
    • Model selection
      • Choose one among a set of candidate models M i , i =1,2,..., m and corresponding model parameters to represent a given data set D = ( D 1 , D 2 , …, D N ).
    • Model Posterior Probability
    • Bayesian information criterion
      • Maximized log data likelihood for the given model with model complexity penalty
      • Bayesian information criterion of model M i
    • where d i is the number of independent
    • parameters in the mode parameter set
  • 17. Unsupervised Segmentation Using Bayesian Information Criterion
    • First model
    • Second model
    • Bayesian information criterion
  • 18. Disadvantages of Conventional Unsupervised Speaker Change Detection
    • Disadvantage:
    • For metric based methods, it’s not easy to decide a suitable threshold
    • For BIC, it’s not easy to detect speaker segment less than 2 seconds
  • 19. Proposed Method -- Misclassification Error Rate
    • Sliding window pairs
    • Feature vector distribution
    Same speaker Different speakers
  • 20. Mathematical Analysis
  • 21. Mathematical Analysis
  • 22. Discussion
    • Generative and discriminant classifiers are both applicable
    • Key Point: Discriminant classifiers have the benefit that smaller data are required
      • We can have smaller scanning window size
      • The ability to detect short speaker change segment increases
  • 23. Speaker Segmentation Using Misclassification Error Rate
    • Steps
      • Preprocessing
        • Framing, Feature extraction
      • Hypothesized speaker change point selection
      • Forcing 2-class labels
      • Training a discriminat hyperplane
      • Inside data recognition & calculating misclassification error rate
      • Accept/reject the hypothesized speaker change point
    • Significance
      • The unsupervised speaker segmentation problem is solved by supervised classification
  • 24. Experimental Results EXPERIMENTAL RESULTS 75.7 54.4 63.3  BIC 81.3 70.2 71.8 Proposed Recall Precision F-score Method
  • 25. Video Summarization
    • Dynamic v.s. Static Video Summarization
      • Dynamic video summarization
        • Sport highlight, movie trailer
      • Static video summarization
        • Storyboard
          • Visual -based approach
          • Incorporation of the semantic Information
  • 26. Static Video Summarization -- Visual Based Approach
    • Example
    • Problem
      • Is the summarization ratio adjustable?
      • How to generate effective storyboard under a given summarization ratio?
  • 27. How to Generate Effective Storyboard
    • Question: Assume there are n frames and the summarization ratio is r/n . How do we select the best r frames ?
    • Complexity:
      • There are C( n , r ) different choices
  • 28. How to Generate Effective Storyboard
    • In visual viewpoint
      • Most visually distinct frames should be extracted
      • Dissimality between two frames is measured by low level visual features
    • How to select best r frames from n frames
      • Solution: maximize the overall pairwise dissimilities
      • Complexity: C( n , r ) x C( r ,2)
      • Unfeasible: C( n , r ) is usually huge
    • Fact
      • Human beings usually browse a storyboard in a sequential way
    • Optimal solution in a sequential sense
      • Maximize the sum of dissimilities from sequential adjacent images in a storyboard
  • 29. How to Maximize the Dissimality Sum of the Extracted Images
    • Lattice-based representative frame extraction approach
      • Extract key component from temporal sequence
      • Dynamic programming can be applied
    • Example: how to select the best 4 images from an 8-image sequence
  • 30. How to Maximize the Adjacent Dissimality Sum of the Extracted Images
    • Original images: O(1), O(2), O(3), O(4), O(5), O(6), O(7), O(8)
    • Extracted images: E(1), E(2), E(3), E(4)
    • E(1) ← O(i); E(1) ← O(j); E(1) ← O(k); E(1) ← O(l); where i < j < k < l
    • Each legal left-to-right path represents a way to extract images
    • Each transition results in an adjacent dissimality
    • In this example, the adjacent dissimality sum of the extracted images are D[ O(1),O(3) ] +
    • D[ O(3),O(4) ] + D[ O(4),O(7) ]
  • 31. How to Maximize the Adjacent Dissimality Sum of the Extracted Images
  • 32. Complexity Comparison
    • Select 4 images from an 8-image sequence
      • Lattice-based approach
        • 45 dissimality comparison
      • Optimal approach
        • 420 dissimality comparison
  • 33. Segment-Based Solution
  • 34. Experimental Results
  • 35. Incorporation of the Semantic Information
    • Conventional
      • The static summarized images are extracted in accordance with low level visual features
    • Disadvantage
      • It’s difficult to catch the main story without the support of semantic significant information
    • We present a semantic based static video summarization
      • Each extracted image has an annotation
      • Related images are connected by edge
      • Using ‘who’ ‘what’ ‘where’ ‘when’ to list all extracted images
  • 36. The Proposed Architecture
    • Shot annotation: mapping visual content to text
    • Concept expansion: It provides an alterative view and dependency information while measuring the relation of two annotations.
    • Relational graph construction
  • 37. Concept Tree Construction
    • The concept tree denotes the dependent structure of the expanded words
    • Meronym
      • ‘ Wheel' is a meronym of 'automobile'.
    • Holonym
      • ‘ Tree' is a holonym of 'bark', of 'trunk' and of 'limb'
    • Pencil used for Draw
    • Salesperson location of Store
    • Motorist capable of Drive
    • Eat breakfast Effect of Full stomach
  • 38. Concept Tree Reorganization
    • Who: names of people, subset of &quot;person&quot; in WordNet
    • Where: &quot;social group,&quot; &quot;building,&quot; and &quot;location &quot; in WordNet
    • What: &quot; All the other words which do not belong to &quot;who&quot; and &quot;where&quot;
    • When: searching for time-period phrase
  • 39. Relational Graph Construction -- Relation of Two Concept Trees
    • The relation of the two concept trees
    • The relation of the two roots
    • The relation of the two children
  • 40. Relational Graph Construction -- Remove Unimportant Vertices and Edges
    • Remove edges with smaller weighting, i.e. lower relation
    • Remove vertices with smaller term frequency – inverse document frequency (TF-IDF)
  • 41. The Final Relational Graph
    • Comparison with conventional storyboard
  • 42. Conclusion
    • A novel speaker segmentation criterion is proposed
      • Misclassification error rate
    • The unsupervised speaker segmentation problem is solved by supervised classification with label-forcing
    • Discriminat classifier makes the proposed approach be able to have smaller scanning window size
      • The ability to detect short speaker change segment increases
    • Two new static video summarization approaches are proposed
    • Lattice-based representative frame extraction
      • Merely using low level visual features
      • The summarization ratio is adjustable
      • Under a given summarization ratio, the dissimality sum from sequential adjacent images is minimized
    • Concept-organized representative frame extraction
      • Incorporating semantic information
      • Mining the four kinds of concept entities: who, what, where, and when
      • People can efficiently grasp the comprehensive structure of the story and understand the main points of the contents
  • 43. Future Work
    • Multimedia segmentation
      • Speech segmentation
      • Audio segmentation
      • Video segmentation
    • Multimedia summarization
      • Video summarization
        • Static , dynamic
      • Speech summarization
      • Audio summarization
  • 44. Thank all of you for your attendance !