Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Introduction to MPEG-7 Guest lecture for ECE417 TSH Charlie Dagli [] April 7, 2009
  2. 2. Contents This lecture : A general idea of MPEG – 7 MPEG-7 –Background –Introduction –Components of MPEG-7  Description Definition Language (DDL)  Multimedia Description Scheme (MDS)  Video Descriptors  Audio Descriptors –References 2
  3. 3. Background  Search and Retrieval of Multimedia data – In recent years, there has been a huge increasing amount of audiovisual data that is becoming available – Applications  Large-scale multimedia search engines on the Web  Media asset management systems in corporations  AV broadcast servers  Personal media servers… – Need: Retrieval, search, storage of the AV-data with higher level concept – A solver:  Efficient processing tools to create description of AV material or to support the identification or retrieval of AV documents. – The research activity on processing tools, the need for interoperability between devices has been recognized and standardization activities have been launched.  MPEG-7, “MULTIMEDIA CONTENT DESCRIPTION INTERFACE”, standardizes the description of multimedia content supporting wide range of applications.  MPEG stands for Moving Picture Experts Group (1988) 3
  4. 4. Introduction : What is MPEG-7? “Multimedia Content Description Interface” –Intuition:  NOT focus so much on processing tools  Concentrate more on the selection of features that have to be described  Find a way to structure and instantiate the selected features with a common language –Efficient representation of audio-visual (AV) meta-data –Goal: allow interoperable searching, indexing, filtering and access of multimedia content by enabling interoperability among devices that deal with multimedia content description. 4
  5. 5. MPEG-7 Main Elements  Descriptor (D) – standardized “audio only” and “visual only” descriptors. <ex> a time code for duration, color histograms for color.  Multimedia Description Scheme (MDS) – standardized description schemes for audio and visual descriptors. <ex> video: temporally structured scenes and shots, including textual descriptors at the scene level and color, motion, audio amplitude descriptors at the shot level.  Description Definition Language (DDL) – provides a standardized language to express description schemes, – based on XML (eXtensible Markup Language) – a language that allows the creation of new description schemes, and possibly, descriptors. Also allows the extension and modification of existing description schemes. 5
  6. 6. What can MPEG-7 do?  Increasing availability of potentially interesting audiovisual materials makes search more difficult.  The searching system that any type of AV material may be retrieved by means of any type of query materials, such as video, music, speech, etc. – Some query examples  Music : Play a few notes on a keyboard and get in return a list of musical pieces containing the required tune or images somehow matching the notes.  Image : Define objects, including color patches or textures and get in return examples among which you select the interesting objects to compose your image  Voice : Using an excerpt of Pavarotti’s voice, and getting a list of Pavarotti’s records, video clips where Pavarotti is singing or video clips where Pavarotti is present.  Sports video analysis: can be solved by a much easier way with better results 6
  7. 7. Application Areas  Application domains listed in the MPEG-7 Applications document: – Education – Journalism (e.g. searching speeches of person using his name, his voice or his face) – Tourist information – Cultural services (museum, art gallery, digital library) – Entertainment (searching a game, karaoke) – Investigation services (human characteristics recognition) – Geographical information systems – Remote sensing – Surveillance (traffic control, surface transportation) – Shopping – Architecture, real estate, and interior design – Social (Dating Service) – Film, Video and Radio archives. …….. – Audiovisual content production 7
  8. 8. MPEG-7 v.s. previous MPEG activities  MPEG-1,2, & 4 are designed to represent the information itself, while MPEG-7 is meant to represent information about the information.  MPEG-1,2, & 4 made content available, MPEG-7 allows you to find the content you need.  Also, MPEG-7 can be used independently of the other MPEG standards – the description might even be attached to an analog movie. 8
  9. 9. MPEG-7 Parts  ISO/IEC TR 15938-1 (Systems) – The binary format for encoding MPEG-7 descriptions and the terminal architecture.  ISO/IEC TR 15938-2 (Description Definition Language) – The language for defining the syntax of the MPEG-7 Description Tools and for defining new Description Schemes.  ISO/IEC TR 15938-3 (Visual) – The Description Tools dealing with Visual descriptions.  ISO/IEC TR 15938-4 (Audio) – The Description Tools dealing with Audio descriptions.  ISO/IEC TR 15938-5 (Multimedia Description Schemes) – The Description Tools dealing with generic features and multimedia descriptions.  ISO/IEC TR 15938-6 (Reference Software) – A Software implementation of relevant parts of the MPEG-7 Standard with normative status.  ISO/IEC TR 15938-7 (Conformance Testing) – Guidelines and procedures for testing conformance of MPEG-7 implementations  ISO/IEC TR 15938-8 (Extraction and use of descriptions) – Informative material (in the form of a Technical Report) about the extraction and use of some of the Description Tools. 9
  10. 10. Next… Description Definition Language (DDL) 10
  11. 11. Description Definition Language (DDL)  Foundations of MPEG-7 standard, provides the language for defining the structure and content of multimedia information  A schema language to represent the results of modeling audiovisual data, (i.e. descriptors, and description schemes) as a set of syntactic, structural and value constraints to which valid MPEG-7 descriptors, description schemes, and descriptions must confirm.  Also provide the rules by which user can combine, extend, and refine existing description schemes and descriptors.  XML. Example <PersonName> <Title> Prof. </Title> <Firstname>Thomas </Firstname> <Lastname>Huang</Lastname> <Nickname>Tom</Nickname> </PersonName> 11
  12. 12. Next…Multimedia Description Schemes (MDS) 12
  13. 13. Multimedia Description Schemes (MDS)  An overview of the organization of MPEG-7 MDS : Organized in 6 Areas, Basic Elements, Content Descriptions, Content Organization, Content management, Navigation and Access, and User Interaction 13
  14. 14. MDS: Basic Elements Basic Elements – fundamental constructs of the definition of MPEG-7 description schemes –Schema Tools :  facilitate the creation of valid MPEG-7 descriptions and packing.. –Basic Data types :  Integer & Real – represent constrained integer and real value  Vectors & Matrix – represent arbitrary sized vectors and matrices of integer or real values  Probability Vectors & Matrices – represent probability distribution described using vectors/matrices  String – represents codes identifying content type, countries, regions, currencies, and character sets –Linking, Identification and Localization Tools :  tools for referencing MPEG-7 descriptions, for linking descriptions to multimedia content and for describing time in multimedia content 14
  15. 15. MDS: Basic Elements –Example: Three kinds of media time representation: t1 t2 Duration TimeBase RelTimePoint  A) Simple time: Specify a time point and a duration  B) Relative time: Specify a media time point relative to a time base, and a duration  C) Incremental time: Specification of time using a predefined interval called Time Unit and counting the number of intervals (efficient for periodic signals) 15
  16. 16. MDS: Basic Elements – Basic Description Tools : A library of description schemes and data types, which are used as primitive components for building more complex and functionality- specific description tools found in the rest of MPEG-7.  Graph and relation tools: weave together complex multimedia description structures <Graph> <Node id = “A”/> <Node id = “A”/> <Node id = “A”/> <Node id – Ex. = “A”/> <Node id = “A”/> <Relation type = “#r1” source “#A” target = “#B”/> r3 r3 D <Relation type = “#r2” source “#A” target = “#C”/> C B r1 ………….. A E r4 r1 r2 </Graph>  Textual annotations: represent textual descriptions – Free text annotation : Spain scores a goal against Sweden. – Keyword annotation : score, Sweden, Spain  Classification schemes and terms: define and reference vocabularies for multimedia content descriptors. – Ex. Part of a ClassificationScheme for sports: sports soccer basketball baseball tennis 16
  17. 17. MDS: Basic Elements  People and locations: represent people and places related to multimedia content – Agent: persons, organizations, groups of persons,…  Ex. <PersonGroup> <Name>Spanish National Soccer Team </Name> <Kind><Name>Soccer Team </Name></Kind> <Member> <Name> Fernando </Name> </Member> <Member> …. </PersonGroup> – Places: existing, historical, and fictional places.  Affective description: describe emotional response to multimedia content – Ex. Recording an audience’s excitement while watching an action movie  Ordering tools: – Provides a hint for ordering descriptions for presentation based on information contained in those descriptions – Ex. Order a set of video segments in a soccer game by the amount of 17 camera zoom within each segment.
  18. 18. Content management and content description 18
  19. 19. MDS: Content Management  Content management : the description of the life cycle of the content, from content to consumption – Creation and Production Description,  Including title, textual annotation, creators, creation locations, dates, how the data is classified, review and guidance information, and related multimedia material. – Usage Description  Describes information related to the usage rights, usage record, and financial information.  Rights information is not explicitly included in the description but links are provided to the rights holders or right management.  Usage record description provides information related to the use of the content, such as broadcasting, or demand delivery.  Financial information provides information related to the cost of production and the income resulting from content use.  Usage description is dynamic and subject to change during the lifetime of the multimedia content. – Media Description  Describes the storage media in particular the compression, coding, and storage format of multimedia content. It describes the master media that is the original source from which different instances of the multimedia content are produced. 19
  20. 20. Content management and content description 20
  21. 21. MDS: Structural Content Description  Content Description: structural and conceptual aspects – Structure Description: describes the structure of multimedia built around the notation of Segment Description Scheme that represents the spatial, temporal, or spatiotemporal portion of the multimedia content  Segment DSs (the core element) – Example: Mosaic DS – panoramic view of video segment constructed by aligning together and warping the frames of a Video Segment upon each other 21
  22. 22. MDS: Structural Content Description  Specific features for structural data description Feature Video Still Moving Audio Segment Region Region Segment Time X X X Shape X X Color X X X Texture X Motion X X Camera X motion Audio X X X features 22
  23. 23. MDS: Structural Content Description  Examples of Image description with Still Regions 23
  24. 24. MDS: Conceptual Content Description  Conceptual aspects: describes the multimedia content from the viewpoint of real-world semantics and conceptual notations. – Involve entities such as objects, events, abstract concepts and relationships. – Segment description schemes and semantic description schemes are related by a set of links that allows the multimedia content to be described on the basis of both content structure and semantics together. 24
  25. 25. MDS: Conceptual Content Description  Example of video segments and Regions Corresponding SegmentRelationship Graph 25
  26. 26. Navigation and access 26
  27. 27. MDS: Navigation and Access  Facilitating browsing and retrieval by defining summaries, views, and variations of the multimedia content.  Summaries: provide compact highlights of the multimedia content to enable discovering, browsing, navigation, and visualization of multimedia content. – Hierarchical navigation mode – Sequential navigation mode 27
  28. 28. MDS: Navigation and Access  View: based on partitions and decompositions, which describes different decompositions of the multimedia signals in space, time, and frequency. The partitions and decompositions can be used as different views of the multimedia content  important for multi-resolution access and progressive retrieval.  Variations: provides different variations of multimedia programs, such as summaries and abstract, scaled, compressed and low-resolution versions and versions with different languages and modalities – audio, video, image, text, and so forth  allow the selection of the most suitable variation of a multimedia program 28
  29. 29. Content organization 29
  30. 30. MDS: Content Organization  Content Organization – tools describe collections and models – Collection: unordered sets of multimedia content, segments, descriptor instances, concepts or mixed sets of the above (Example of collections of AV content including the relationships (i.e. RAB,RBC,RAC) within and across Collection Clusters) Collection structure Content collection Segment collection Descriptor collection Collection (abstract) Concept collection Mixed collection 30
  31. 31. MDS: Content Organization – Model tools: Parameterized representation of an instance or class multimedia content, descriptors or collections, as follows:  Probability model : Associates statistics or probabilities with the attributes of multimedia content, descriptors or collections  Analytic model: Associates labels or semantics with multimedia content or collections  Cluster model: Associates labels or semantics and statistics or probabilities with multimedia content collections  Classification model: Describes information about known collections of multimedia content in terms of labels, semantics, and models that can be used to classify unknown multimedia content Model (abstract) Classification Model Probability Model Analytic Model Cluster Model  Cluster Model  Probability Model  Collection Model  ClusterClassification Model  Discrete distribution  Probability Model class  ProbabilityClassification  Continuous Model distribution  Finite State Model 31
  32. 32. MDS: Content Organization – Clusters of positive and negative examples of images are described using Cluster Model tool. – Soccer video sequence modeled using State Transition Model tool. 32
  33. 33. User Interaction 33
  34. 34. MDS: User Interaction  User interaction describes user preferences and usage history  Allow matching between user preferences and MPEG-7 content description  facilitate personalization of multimedia content access, presentation, and consumption. 34
  35. 35. Introduction to MPEG-7 Guest lecture for ECE417 TSH Charlie Dagli [] April 7, 2009
  36. 36. Introduction : What is MPEG-7? “Multimedia Content Description Interface” –Intuition:  NOT focus so much on processing tools  Concentrate more on the selection of features that have to be described  Find a way to structure and instantiate the selected features with a common language –Provide a way to get information about the audiovisual (AV) data without the need of performing the actual decoding of these data. –Goal: allow interoperable searching, indexing, filtering and access of multimedia content by enabling interoperability among devices that deal with multimedia content description. 36
  37. 37. MPEG-7 Main Elements  Descriptor (D) – provides standardized “audio only” and “visual only” descriptors. <ex> a time code for duration, color histograms for color.  Multimedia Description Scheme (MDS) – provides standardized description schemes involving both audio and visual descriptors. <ex> a movie, temporally structured as scenes and shots, including textual descriptors at the scene level and color, motion, audio amplitude descriptors at the shot level.  Description Definition Language (DDL) – provides a standardized language to express description schemes, – based on XML (eXtensible Markup Language) – a language that allows the creation of new description schemes, and possibly, descriptors. Also allows the extension and modification of existing description schemes.  Coding Schemes – compressing MPEG-7 textual XML descriptions into Binary format (BiM) to satisfy application requirements for compression efficiency, error resilience, ...  SYSTEM: 37
  38. 38. Visual Descriptors  Cover 6 basic visual features as –Color –Texture –Shape –Motion –Localization –Face Recognition 38
  39. 39. Color descriptors  Color Descriptors – Color Space : defines the color components as continuous-value entities  R, G, B  Y, Cr, Cb – Y = 0.299R + 0.587G + 0.114B – Cb = – 0.169R – 0.331G + 0.500B Min (whiteness) – Cr = 0.500R – 0.419G – 0.081B  H, S, V (Hue, Saturation, Value) – A nonlinear transform of the RGB – Quantized into 16,32,64,128,256 bins for scalable color descriptor and frames histogram descriptor  HMMD (Hue, Max, Min, Diff, Sum) – Max = max (R, G, B) – Min = min (R, G, B) – Diff = Max – Min Max (blackness) – Sum = (Max + Min ) / 2  Linear transformation matrix with reference to R, G, B – Any 3 x 3 color transform matrix that specifies the linear transformation between RGB and the respective color space.  Monochrome: Y component alone in YCrCb is used 39
  40. 40. Color Descriptors –Color Quantization Descriptor : specifies the partitioning of the given color space into discrete bins. –Dominant Color Descriptor (DCD): allows specification of a small number of dominant color values as well as their statistical properties, such as distribution and variance  provides an effective an compact representation of colors present in a region or an image.  DCD is defined to be F = {(ci, pi, vi), s}, (i = 1, 2, .. N), N is the number of dominant colors ci  dominant color value, a vector of corresponding color space component values pi  the fraction of pixels in the image corresponding to ci vi  the variation of the color values of the pixels in a cluster around the corresponding representative color s  the spatial coherency, represents the overall spatial homogeneity (Examples of low and high spatial coherency of color) 40
  41. 41. Color Descriptors –Scalable Color Descriptor : a Haar transform-based encoding scheme applied across values of a color histogram in the HSV color space – Useful for image-to-image matching and retrieval based on color feature. Its binary representation is scalable in terms of bin numbers and bit representation accuracy over a broad range of data rate. –Group-of-Frame or Group-of-Picture Descriptor :  For joint representation of color-based features for multiple images or multiple frames in a video segment  Traditionally for a group of frames or pictures  a key frame or image is selected and the color-related features of the entire collection are represented by the chosen sample  unreliable  By GoF and GoP  histogram based descriptors that reliably capture the color content of multiple images or video frames. 41
  42. 42. Color Descriptors – Color Layout Descriptor (CLD) : represents the spatial distribution of representative colors on a grid superimposed on a region or image. Representation is based on coefficients of Discrete Cosine Transform. This is a very compact descriptor being highly efficient in fast browsing and search applications. – Color Structure Descriptor (CSD): based on color histogram, but aims at identifying localized color distributions using a small structuring window. To guarantee, interoperability, the CSD is bound to the HMMD color space. – CSD: the degree to which its pixels are clumped together relative to the scale of an associated structuring element. Examples of structured and unstructured color. 42
  43. 43. Texture Descriptors  Homogeneous Texture Descriptor (HTD): – provides a quantitative representation using 62 numbers, consisting of the mean energy and energy deviation from a set of frequency channel – Useful for similarity retrieval – Effective in characterizing homogeneous texture regions  Texture Browsing Descriptor (TBD): – Defined for coarse level texture browsing – Provides a perceptual characterization of texture, similar to human characterization, in terms of regularity, coarseness and directionality of the texture pattern.  Edge Histogram Descriptor (EHD): – Capture spatial distribution of edges in an image – Useful in matching regions with partially varying, non-uniform texture. 43
  44. 44. Homogeneous Texture Descriptor • Texture Descriptor – Homogeneous Texture Descriptor (HTD): characterize the region texture using the mean energy and the energy deviation from a set of frequency channel. The 2D frequency plane is partitioned into 30 channels as the following: (Frequency layout for feature extraction) ω The Syntax of the HTD is as follows: HTD = [fDC, fSD, e1, e2, ..,e30, d1, d2, .. ,d30] Where fDC and fSD are the mean and standard deviation of input images, and ei and di are the nonlinearly scaled and quantized mean energy and energy 44 deviation of the i-th channel.
  45. 45. Texture Browsing Descriptor – Texture Browsing : Perceptual characterization of a texture, similar to a human characterization, in terms of regularity, coarseness and directionality   – TBD = [v1,v2,v3,v4,v5]  v1 ∈ {1, 2, 3, 4} or {00,01,10,11}: represents the regularity  v2,v3 ∈ {1, 2, 3, 4, 5, 6} : capture the directionality of the texture  v4, v5 ∈ {1, 2, 3, 4}: capture the coarseness of the texture Regularity Semantics 00 irregular 01 slightly regular 10 regular 11 highly regular Semantics of Regularity.                   11 01 00 10 Regularity Examples of Regularity 45
  46. 46. Edge Histogram Descriptor – Edge Histogram: represents local edge distribution in the image  Five types of edges: 5 histogram bins per each sub-image BinCounts[k] Semantics BinCounts[0] Vertical edges in sub-image (0,0) BinCounts[1] Horizontal edges in sub-image (0,0) BinCounts[2] 45 degree edges in sub-image (0,0) BinCounts[3] 135 degree edges in sub-image (0,0) BinCounts[4] Non-directional edges in sub-image (0,0) BinCounts[5] Vertical edges in sub-image (0,1)   BinCounts[74] Non-directional edges in sub-image (3,2) BinCounts[75] Vertical edges in sub-image (3,3) BinCounts[76] Horizontal edges in sub-image (3,3) BinCounts[77] 45 degree edges in sub-image (3,3) BinCounts[78] 135 degree edges in sub-image (3,3) BinCounts[79] Non-directional edges in sub-image (3,3) 46
  47. 47. Shape Descriptors  Shape Descriptors – Region-based Shape Descriptor  Expresses pixel distribution within a 2-D object or region.  Based on both boundary and internal pixels and can describe complex objects consisting of multiple disconnected regions as well as simple objects with or without holes. – Contour-based Shape Descriptor  Based on CSS representation of the contour – 3-D Spectrum Descriptor  Expresses characteristic features of objects represented as discrete polygonal 3-D meshes.  Based on the histogram of local geometrical properties of the 3-Dsurfaces of the object. 47
  48. 48. Shape Descriptors – Region-based shape descriptor utilizes a set of ART(Angular Radial Transform) coefficients. Twelve angular and three radial functions are used (n < 3, m < 12). Fnm is an ART coefficient of order n and m. V is ART basis function and f is an image function V (ART basis function) is separable along the angular and radial directions (Real part of the ART basis functions)  ART coefficients are divided by the magnitude of ART coefficient of order n= 0, m = 0, which is not used as a descriptor element.  Quantization is applied to each coefficient using 4 bit per coefficient to minimize the size of the descriptor 48
  49. 49. Shape Descriptors – Contour-based Shape Descriptor : describes a closed contour of a 2D object or region in image or video sequence. Based on the Curvature Scale Space (CSS) representation of the contour (A 2D visual object (region) and its corresponding shape) Field No. of bits Meaning No. of peaks 6 No. of peaks in CSS image Circularity and eccentricity 2×6 GlobalCurvature of the contour Circularity and eccentricity 2×6 PrototypeCurvature of the smoothed contour Absolute height of the highest HighestPeakY 7 peak (quantized) X-position on the contour of a PeakX[] 6 peak (quantized) Height of the peak PeakY[] 3 (quantized) (CSS Image Formation) 49 Smoothing evolution of zero-crossing
  50. 50. Shape Descriptors  Contour-based Shape Descriptor has the following properties • It can distinguish between shapes that have similar region-shape properties but different contour-shape properties. – · It supports search for shapes that are semantically similar for humans – · It is robust to significant non-rigid deformations – · It is robust to distortions in the contour due to perspective transformations, which are common in the images and video – · It is robust to noise present on the contour. – · It is very compact (14 Bytes per contour on average). – · The descriptor is easy to implement and offers fast extraction and matching. 50
  51. 51. Shape Descriptors (3-Dimensional Class) – 3-D Shape spectrum descriptor : This descriptor specifies an intrinsic shape description for 3D mesh models. It exploits some local attributes of the 3D surface.  The shape index, introduced by Koenderink, is defined as a function of the two principal curvatures, and associated with point p on the 3D surface S. with  By definition, the shape index value is in the interval [0,1]  The shape spectrum of the 3D mesh (3D-SSD) is the histogram of the shape indices (Ip‘s) calculated over the entire mesh. 51
  52. 52. Motion Descriptors  Camera Motion Descriptor  Motion Trajectory Descriptor  Parametric Motion Descriptor  Motion Activity Descriptor Moving region Video segment Camera motion Mosaic Motion trajectory Motion activity Warping Parametric motion parameters 52
  53. 53. Motion Descriptors  Motion Descriptors – Camera Motions: pan, track, tilt, boom, zoom, dolly, roll, absence perspective projection and camera motion parameters 53
  54. 54. Motion Descriptors – Motion Trajectory : describes the displacements of objects in time. A high level feature associated to a moving region, defined as the spatiotemporal localization of one of its representative points (such as its center) as a list of key points (x, y, z, t) – Parametric Motion : describing the motion of objects in video sequences as a 2D parametric model.  Affine Models (6): translations, rotations, scaling and combination of these.  Planar Perspective Models (8) : Global deformations with perspective projections  Quadratic Models (12) : describes more complex movements – Motion Activity : Intuitive notion of ‘intensity of action’ or ‘pace of action’ in a video segment.  Example of high “activity”: Goal scoring in a soccer match  Can be used in diverse applications such as content repurposing, video summarization, surveillance, content-based querying, etc.  Four attributes: – Intensity of activity: indicate high or low activity by a integer lying in [1—5] – Direction of activity: expresses the dominant direction of the activity if any – Spatial distribution of activity: the number and size of active regions in a frame – Temporal distribution of activity: expresses the variation of activity over the duration 54
  55. 55. Localization Descriptors  Localization Descriptors – Region Locator : Localization of regions within images or frames by specifying them with a brief and scalable representation of a Box or a Polygon. Procedure consists of the following 2 steps  Extraction of vertices of the region to be localized  Localization of the region within the image or frame (localization using a polygonal and Box element of the RegionLocator) – Spatio Temporal Locator: describes spatial-temporal regions in a video sequence, such as moving object regions, and provides localization functionality. 55
  56. 56. Face Recognition Descriptor FaceRecognition Descriptor : Used to retrieve face images which match a query face image. –Face Recognition : The projection of a face vector onto a set of 48 basis eigenvectors U (‘eigenfaces’) which span the space of possible face vectors. –Feature Extraction : The FaceRecognition feature set is extracted from a normalized face image. This normalized face image contains 56 lines with 46 intensity values in each line. The centre of the two eyes in each face image are located on the 24th row and the 16th and 31st column for the right and left eye respectively. Features are given by the vector W and is the mean face vector. The features are normalized and clipped using Z=2048 as follows. 56
  57. 57. Face descriptor – Automatic Face Image Localization (Block Diagram of the Automatic face Image Localization algorithm)  Color Segmentation (A color segmentation example: a) the skin color region in the Cb-Cr plane b) original image c) results of the color segmentation algorithm) 57
  58. 58. Audio descriptors  Overview of Audio Framework including Descriptors 58
  59. 59. Audio Descriptors  Basic Descriptors: temporally sampled scalar values for general use, applicable to all kinds of signals – AudioWaveform Descriptor : Audio waveform envelope (minimum and maximum), typically for display purposes – AudioPower Descriptor : the temporally smoothed instantaneous power, which is useful as a quick summary of a signal, and in conjunction with the power spectrum.  Basic Spectral Descriptors: all deriving from a single time-frequency analysis of an audio signal – AudioSpectrumEnvelope Descriptor : a logarithmic-frequency spectrum, spaced by a power-of-two divider (multiple of an octave) – AudioSpectrumCentroid Descriptor : the center of gravity of the log- frequency power spectrum, which describes the shape of the power spectrum 59
  60. 60. Audio Descriptors – AudioSpectrumSpread Descriptor : complementary of the previous descriptor by describing the second moment of log-frequency power spectrum. This may help distinguish between pure-tone and noise-like sounds – AudioSpectrumFlatness Descriptor : the flatness properties of the spectrum of an audio signal for each of a number of frequency bands. When this indicates a high deviation from a flat spectral shape for a given band, it may signal the presence of tonal components (Example of AudioSpectrumEnvelope description of a pop song) Visualized using a spectrogram. Required data storage is NM values where N is the no. of spectrum bins and M is the no. of time points 60
  61. 61. Audio Descriptors  Spectral Basis Descriptor: low-dimensional projections of a high- dimensional spectral space to aid compactness and recognition, which are used primarily with the Sound Classification and Indexing Description Tools – AudioSpectrumBasis : a series of basis functions that are derived from the singular value decomposition of a normalized power spectrum – AudioSpectrumProjection : Used with above descriptor, and represents low- dimensional features of a spectrum after projection upon a reduced rank basis. (Example: A 10-basis component reconstruction showing most of the detail of the original spectrogram including guitar, bass guitar, etc.) The left vectors are an AudioSpectrumBasis Descriptor and the top vectors are the corresponding AudioSpectrumProjection Descriptor. The required data storage is 10(M+N) values 61
  62. 62. Audio Descriptors  Signal Parameters : apply chiefly to periodic or quasi-periodic signals – AudioFundamentalFrequency Descriptor : fundamental frequency of an audio signal, which represents for a confidence measure in recognition of the fact that the various extraction methods, commonly called “pitch- tracking”, are not perfectly accurate. – AudioHarmonicity Descriptor : the harmonicity of a signal, allowing distinction between sounds with a harmonic spectrum (e.g., musical tones or voiced speech [vowels like ‘a’]), sounds with an inharmonic spectrum (e.g., metallic or bell-like sounds) and sounds with a non-harmonic spectrum (e.g., noise, unvoiced speech [fricatives like ‘f’], or dense mixtures of instruments). 62
  63. 63. Audio Descriptors  Timbral Temporal Descriptor : temporal characteristics of segments of sounds, useful for the description of musical timbre( characteristic tone quality independent of pitch and loudness). – LogAttackTime Descriptor : the ‘attack’ of a sound, the time it takes for the signal to rise from silence to the maximum amplitude. It tells the difference between a sudden and a smooth sound – TemporalCentroid Descriptor : the signal envelope, representing where in time the energy of a signal is focused. It is used for the distinction between a decaying piano note and a sustained organ note, when the lengths and the attacks of the two notes are identical.  Timbral Spectral Descriptor : spectral features in a linear-frequency space especially applicable to the perception of musical timbre. – SpectralCentroid Descriptor : the power-weighted average of the frequency of the bins in the linear power spectrum. Very similar to the AudioSpectrumCentroid, but specialized for use in distinguishing musical instrument timbres. It tells the “sharpness” of a sound. 63
  64. 64. Audio Descriptors – HarmonicSpectralCentroid Descriptor : the amplitude-weighted mean of the harmonic peaks of the spectrum. It has a similar semantic to the other centroid descriptors, but applies only to the harmonic parts of the musical tone. – HarmonicSpectralDeviation Descriptor : the spectral deviation of log-amplitude components from a global spectral envelope. – HarmonicSpectralSpread Descriptor : the amplitude-weighted standard deviation of the harmonic peaks of the spectrum, normalized by the instantaneous HarmonicSpectralCentroid. – HarmonicSpectralVariation Descriptor : the normalized correlation between the amplitude of the harmonic peaks between two subsequent time-slices of the signal.  Silence Segment : attaches the simple semantic of “silence” (i.e. no significant signal) to an Audio Segment. It may be used to aid further segmentation of the audio stream, or as a hint not to process a segment. 64
  65. 65. Audio Descriptors  High-level Audio Description Tools (Ds and DSs) – Audio Signature DS : A condensed representation of an audio signal designed to provide a unique content for the purpose of robust automatic identification of audio signals. Applications include audio fingerprinting, identification of audio based on a database of known works – Musical Instrument Timbre Description Tools  HarmonicInstrumentTimbre Descriptor : Four harmonic timbral spectral Descriptors with the LogAttackTime Descriptor  PercussiveInstrumentTimbre Descriptor : The timbral temporal Descriptors with a SpectralCentroid Descriptor – Melody Description Tools  Include a rich representation for monophonic melodic information to facilitate efficient, robust, and expressive melodic similarity matching.  MelodyContour DS: terse, efficient melody contour representation  MelodySequence DS: a more verbose, complete, expressive melody representation 65
  66. 66. Audio Descriptors – General Sound Recognition and Indexing Description Tools  A collection of tools for indexing and categorization of sound (effects) in general  SoundModelStatePath Descriptor: states generated by a sound model  SoundModelStateHistogram Descriptor: normalized histogram of the state sequence generated by a sound model – Spoken Content Description Tools  Consists of combined word and phone lattices for each speaker in an audio stream. Use phone lattices to alleviate out-of-vocabulary problem (OOV)  SpokenContentLattice Description Scheme : the actual decoding produced by an ASR(Automatic Speech Recognition) engine.  SpokenContentHeader : information about the speakers being recognized and the recognizer itself. 66
  67. 67. References  Book – Introduction to MPEG-7: Multimedia Content Description Interface B. S. Manjunath (Editor), Philippe Salembier (Editor), Thomas Sikora (Editor) ISBN: 0-471-48678-7 productCd-0471486787.html  MPEG-7  MPEG-7 DDL Homepage 67