Your SlideShare is downloading. ×
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply



Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Multimedia Segmentation and Summarization Dr. Jia-Ching Wang Honorary Fellow, ECE Department, UW-Madison
  • 2. Outline
    • Introduction
    • Speaker Segmentation
    • Video Summarization
    • Conclusion
  • 3. What is Multimedia?
    • Image
    • Video
    • Speech
    • Audio
    • Text
  • 4. Multimedia Everywhere
    • Fax machines: transmission of binary images
    • Digital cameras: still images
    • iPod / iPhone & MP3
    • Digital camcorders: video sequences with audio
    • Digital television broadcasting
    • Compact disk (CD), Digital video disk (DVD)
    • Personal video recorder (PVR, TiVo)
    • Images on the World Wide Web
    • Video streaming, video conferencing
    • Video on cell phones, PDAs
    • High-definition televisions (HDTV)
    • Medical imaging: X-ray, MRI, ultrasound
    • Military imaging: multi-spectral, satellite, microwave
  • 5. What is Multimedia Content?
    • Multimedia content: the syntactic and semantic information inherent in a digital material.
    • Example: text document
      • Syntactic content: chapter, paragraph
      • Semantic content: key words, subject, types of text document, etc.
    • Example: video document
      • Syntactic content: scene cuts, shots
      • Semantic content: motion, summary, index, caption, etc.
  • 6. Why We Need to Know Multimedia Content?
    • Why we need to know multimedia content?
      • Information processing, in terms of archiving, indexing, delivering, accessing and other processing, require in-depth knowledge of content to optimize the performance.
  • 7. How to Know Multimedia Content?
    • How to Know Multimedia Content?
      • Multimedia content analysis
        • The computerized understanding of the semantic/syntactic of a multimedia document
    • Multimedia content analysis usually involves
      • Segmentation
        • Segmenting the multimedia document into units
      • Classification
        • Classifying each unit into a predefined type
      • Annotation
        • Annotating the multimedia document
      • Summarization
        • Summarizing the multimedia document
  • 8. Multimedia Segmentation and Summarization
    • Multimedia segmentation
      • Syntactic content
    • Multimedia summarization
      • Semantic/syntactic content
    • The result of the temporal segmentation can benefit the video summarization
  • 9. Multimedia Segmentation
    • Image segmentation
    • Video segmentation
      • Scene change, shot change
    • Audio segmentation
      • Audio class change
    • Speech segmentation
      • Speaker change detection
    • Text Segmentation
      • word segmentation, sentence segmentation, topic change detection
  • 10. Multimedia Summarization
    • Image summarization
      • Region of interest
    • Video summarization
      • Storyboard, highlight
    • Audio summarization
      • Main theme in music, Corus in song, event sound in environmental sound stream
    • Speech summarization
      • Speech abstract
    • Text summarization
      • Abstract
  • 11. What is Speaker Segmentation?
    • It can also be called speaker change detection (SCD)
    • Assumption: there is no overlapping between any of the two speaker streams
    speaker1 speaker2 speaker3
  • 12. Supervised v.s. Unsupervised SCD
    • Supervised manner: acoustic data are made up of distinct speakers who are known a priori
      • Recognition based solution
    • Unsupervised manner: n o prior knowledge about the number and identities of speakers
      • Metric-based criterion
      • Model selection-based criterion
  • 13. Supervised Speaker Segmentation -- Gaussian Mixture Model
    • Gaussian mixture modeling (GMM)
    • Incoming audio stream is classified into one of D classes in a maximum likelihood manner at time t
    x is a d-dimensional random vector. , i =1,…, M is the mixture weight. ,the mean vector. ,the covariance matrix.
  • 14. Supervised Speaker Segmentation -- Hidden Markov Model
  • 15. Unsupervised Speaker Segmentation -- Sliding Window Strategy & Detection Criterion
    • Metric-based criterion (The dissimilarities between the acoustic feature vectors are measured)
      • Kullback-Leibler distance
      • Mahalanobis distance
      • Bhattacharyya distance
    • Model selection-based criterion
      • Bayesian information criterion (BIC)
  • 16. Bayesian Information Criterion
    • Model selection
      • Choose one among a set of candidate models M i , i =1,2,..., m and corresponding model parameters to represent a given data set D = ( D 1 , D 2 , …, D N ).
    • Model Posterior Probability
    • Bayesian information criterion
      • Maximized log data likelihood for the given model with model complexity penalty
      • Bayesian information criterion of model M i
    • where d i is the number of independent
    • parameters in the mode parameter set
  • 17. Unsupervised Segmentation Using Bayesian Information Criterion
    • First model
    • Second model
    • Bayesian information criterion
  • 18. Disadvantages of Conventional Unsupervised Speaker Change Detection
    • Disadvantage:
    • For metric based methods, it’s not easy to decide a suitable threshold
    • For BIC, it’s not easy to detect speaker segment less than 2 seconds
  • 19. Proposed Method -- Misclassification Error Rate
    • Sliding window pairs
    • Feature vector distribution
    Same speaker Different speakers
  • 20. Mathematical Analysis
  • 21. Mathematical Analysis
  • 22. Discussion
    • Generative and discriminant classifiers are both applicable
    • Key Point: Discriminant classifiers have the benefit that smaller data are required
      • We can have smaller scanning window size
      • The ability to detect short speaker change segment increases
  • 23. Speaker Segmentation Using Misclassification Error Rate
    • Steps
      • Preprocessing
        • Framing, Feature extraction
      • Hypothesized speaker change point selection
      • Forcing 2-class labels
      • Training a discriminat hyperplane
      • Inside data recognition & calculating misclassification error rate
      • Accept/reject the hypothesized speaker change point
    • Significance
      • The unsupervised speaker segmentation problem is solved by supervised classification
  • 24. Experimental Results EXPERIMENTAL RESULTS 75.7 54.4 63.3  BIC 81.3 70.2 71.8 Proposed Recall Precision F-score Method
  • 25. Video Summarization
    • Dynamic v.s. Static Video Summarization
      • Dynamic video summarization
        • Sport highlight, movie trailer
      • Static video summarization
        • Storyboard
          • Visual -based approach
          • Incorporation of the semantic Information
  • 26. Static Video Summarization -- Visual Based Approach
    • Example
    • Problem
      • Is the summarization ratio adjustable?
      • How to generate effective storyboard under a given summarization ratio?
  • 27. How to Generate Effective Storyboard
    • Question: Assume there are n frames and the summarization ratio is r/n . How do we select the best r frames ?
    • Complexity:
      • There are C( n , r ) different choices
  • 28. How to Generate Effective Storyboard
    • In visual viewpoint
      • Most visually distinct frames should be extracted
      • Dissimality between two frames is measured by low level visual features
    • How to select best r frames from n frames
      • Solution: maximize the overall pairwise dissimilities
      • Complexity: C( n , r ) x C( r ,2)
      • Unfeasible: C( n , r ) is usually huge
    • Fact
      • Human beings usually browse a storyboard in a sequential way
    • Optimal solution in a sequential sense
      • Maximize the sum of dissimilities from sequential adjacent images in a storyboard
  • 29. How to Maximize the Dissimality Sum of the Extracted Images
    • Lattice-based representative frame extraction approach
      • Extract key component from temporal sequence
      • Dynamic programming can be applied
    • Example: how to select the best 4 images from an 8-image sequence
  • 30. How to Maximize the Adjacent Dissimality Sum of the Extracted Images
    • Original images: O(1), O(2), O(3), O(4), O(5), O(6), O(7), O(8)
    • Extracted images: E(1), E(2), E(3), E(4)
    • E(1) ← O(i); E(1) ← O(j); E(1) ← O(k); E(1) ← O(l); where i < j < k < l
    • Each legal left-to-right path represents a way to extract images
    • Each transition results in an adjacent dissimality
    • In this example, the adjacent dissimality sum of the extracted images are D[ O(1),O(3) ] +
    • D[ O(3),O(4) ] + D[ O(4),O(7) ]
  • 31. How to Maximize the Adjacent Dissimality Sum of the Extracted Images
  • 32. Complexity Comparison
    • Select 4 images from an 8-image sequence
      • Lattice-based approach
        • 45 dissimality comparison
      • Optimal approach
        • 420 dissimality comparison
  • 33. Segment-Based Solution
  • 34. Experimental Results
  • 35. Incorporation of the Semantic Information
    • Conventional
      • The static summarized images are extracted in accordance with low level visual features
    • Disadvantage
      • It’s difficult to catch the main story without the support of semantic significant information
    • We present a semantic based static video summarization
      • Each extracted image has an annotation
      • Related images are connected by edge
      • Using ‘who’ ‘what’ ‘where’ ‘when’ to list all extracted images
  • 36. The Proposed Architecture
    • Shot annotation: mapping visual content to text
    • Concept expansion: It provides an alterative view and dependency information while measuring the relation of two annotations.
    • Relational graph construction
  • 37. Concept Tree Construction
    • The concept tree denotes the dependent structure of the expanded words
    • Meronym
      • ‘ Wheel' is a meronym of 'automobile'.
    • Holonym
      • ‘ Tree' is a holonym of 'bark', of 'trunk' and of 'limb'
    • Pencil used for Draw
    • Salesperson location of Store
    • Motorist capable of Drive
    • Eat breakfast Effect of Full stomach
  • 38. Concept Tree Reorganization
    • Who: names of people, subset of &quot;person&quot; in WordNet
    • Where: &quot;social group,&quot; &quot;building,&quot; and &quot;location &quot; in WordNet
    • What: &quot; All the other words which do not belong to &quot;who&quot; and &quot;where&quot;
    • When: searching for time-period phrase
  • 39. Relational Graph Construction -- Relation of Two Concept Trees
    • The relation of the two concept trees
    • The relation of the two roots
    • The relation of the two children
  • 40. Relational Graph Construction -- Remove Unimportant Vertices and Edges
    • Remove edges with smaller weighting, i.e. lower relation
    • Remove vertices with smaller term frequency – inverse document frequency (TF-IDF)
  • 41. The Final Relational Graph
    • Comparison with conventional storyboard
  • 42. Conclusion
    • A novel speaker segmentation criterion is proposed
      • Misclassification error rate
    • The unsupervised speaker segmentation problem is solved by supervised classification with label-forcing
    • Discriminat classifier makes the proposed approach be able to have smaller scanning window size
      • The ability to detect short speaker change segment increases
    • Two new static video summarization approaches are proposed
    • Lattice-based representative frame extraction
      • Merely using low level visual features
      • The summarization ratio is adjustable
      • Under a given summarization ratio, the dissimality sum from sequential adjacent images is minimized
    • Concept-organized representative frame extraction
      • Incorporating semantic information
      • Mining the four kinds of concept entities: who, what, where, and when
      • People can efficiently grasp the comprehensive structure of the story and understand the main points of the contents
  • 43. Future Work
    • Multimedia segmentation
      • Speech segmentation
      • Audio segmentation
      • Video segmentation
    • Multimedia summarization
      • Video summarization
        • Static , dynamic
      • Speech summarization
      • Audio summarization
  • 44. Thank all of you for your attendance !