Mpeg7

Introduction to MPEG-7

Guest lecture for ECE417 TSH

Charlie Dagli
[dagli@illinois.edu]

April 7, 2009

Contents
This lecture : A general idea of MPEG – 7
MPEG-7
–Background
–Introduction
–Components of MPEG-7
 Description Definition Language (DDL)
 Multimedia Description Scheme (MDS)
 Video Descriptors
 Audio Descriptors
–References

2

Background
 Search and Retrieval of Multimedia data
– In recent years, there has been a huge increasing amount of audiovisual data
that is becoming available
– Applications
 Large-scale multimedia search engines on the Web
 Media asset management systems in corporations
 AV broadcast servers
 Personal media servers…
– Need: Retrieval, search, storage of the AV-data with higher level concept
– A solver:
 Efficient processing tools to create description of AV material or to support the
identification or retrieval of AV documents.
– The research activity on processing tools, the need for interoperability
between devices has been recognized and standardization activities have
been launched.
 MPEG-7, “MULTIMEDIA CONTENT DESCRIPTION INTERFACE”,
standardizes the description of multimedia content supporting wide range of
applications.
 MPEG stands for Moving Picture Experts Group (1988)
3

Introduction : What is MPEG-7?
“Multimedia Content Description Interface”
–Intuition:
 NOT focus so much on processing tools
 Concentrate more on the selection of features that have to be described
 Find a way to structure and instantiate the selected features with a
common language
–Efficient representation of audio-visual (AV) meta-data
–Goal: allow interoperable searching, indexing, filtering and
access of multimedia content by enabling interoperability
among devices that deal with multimedia content description.

4

MPEG-7 Main Elements
 Descriptor (D) – standardized “audio
only” and “visual only” descriptors. <ex>
a time code for duration, color histograms
for color.
 Multimedia Description Scheme (MDS)
– standardized description schemes for
audio and visual descriptors. <ex> video:
temporally structured scenes and shots,
including textual descriptors at the scene
level and color, motion, audio amplitude
descriptors at the shot level.
 Description Definition Language
(DDL) – provides a standardized
language to express description schemes,
– based on XML (eXtensible Markup
Language) – a language that allows the
creation of new description schemes, and
possibly, descriptors. Also allows the
extension and modification of existing
description schemes.
5

What can MPEG-7 do?
 Increasing availability of potentially interesting audiovisual
materials makes search more difficult.

 The searching system that any type of AV material may be
retrieved by means of any type of query materials, such as video,
music, speech, etc.
– Some query examples
 Music : Play a few notes on a keyboard and get in return a list of musical pieces
containing the required tune or images somehow matching the notes.
 Image : Define objects, including color patches or textures and get in return
examples among which you select the interesting objects to compose your image
 Voice : Using an excerpt of Pavarotti’s voice, and getting a list of Pavarotti’s
records, video clips where Pavarotti is singing or video clips where Pavarotti is
present.
 Sports video analysis: can be solved by a much easier way with better results

6

Application Areas
 Application domains listed in the MPEG-7 Applications document:
– Education
– Journalism (e.g. searching speeches of person using his name, his voice or
his face)
– Tourist information
– Cultural services (museum, art gallery, digital library)
– Entertainment (searching a game, karaoke)
– Investigation services (human characteristics recognition)
– Geographical information systems
– Remote sensing
– Surveillance (traffic control, surface transportation)
– Shopping
– Architecture, real estate, and interior design
– Social (Dating Service)
– Film, Video and Radio archives. ……..
– Audiovisual content production

7

MPEG-7 v.s. previous MPEG activities
 MPEG-1,2, & 4 are designed to represent the information itself,
while MPEG-7 is meant to represent information about the
information.

 MPEG-1,2, & 4 made content available, MPEG-7 allows you to
find the content you need.

 Also, MPEG-7 can be used independently of the other MPEG
standards – the description might even be attached to an analog
movie.

8

MPEG-7 Parts
 ISO/IEC TR 15938-1 (Systems)
– The binary format for encoding MPEG-7 descriptions and the terminal architecture.
 ISO/IEC TR 15938-2 (Description Definition Language)
– The language for defining the syntax of the MPEG-7 Description Tools and for
defining new Description Schemes.
 ISO/IEC TR 15938-3 (Visual)
– The Description Tools dealing with Visual descriptions.
 ISO/IEC TR 15938-4 (Audio)
– The Description Tools dealing with Audio descriptions.
 ISO/IEC TR 15938-5 (Multimedia Description Schemes)
– The Description Tools dealing with generic features and multimedia descriptions.
 ISO/IEC TR 15938-6 (Reference Software)
– A Software implementation of relevant parts of the MPEG-7 Standard with
normative status.
 ISO/IEC TR 15938-7 (Conformance Testing)
– Guidelines and procedures for testing conformance of MPEG-7 implementations
 ISO/IEC TR 15938-8 (Extraction and use of descriptions)
– Informative material (in the form of a Technical Report) about the extraction and use
of some of the Description Tools.

9

Next… Description Definition Language (DDL)

10

Description Definition Language (DDL)
 Foundations of MPEG-7 standard, provides the language for
defining the structure and content of multimedia information
 A schema language to represent the results of modeling
audiovisual data, (i.e. descriptors, and description schemes) as
a set of syntactic, structural and value constraints to which
valid MPEG-7 descriptors, description schemes, and
descriptions must confirm.
 Also provide the rules by which user can combine, extend, and
refine existing description schemes and descriptors.
 XML. Example
<PersonName>
<Title> Prof. </Title>
<Firstname>Thomas </Firstname>
<Lastname>Huang</Lastname>
<Nickname>Tom</Nickname>
</PersonName>

11

Next…Multimedia Description Schemes (MDS)

12

Multimedia Description Schemes (MDS)
 An overview of the organization of MPEG-7 MDS : Organized
in 6 Areas, Basic Elements, Content Descriptions, Content
Organization, Content management, Navigation and Access, and
User Interaction

13

MDS: Basic Elements
Basic Elements – fundamental constructs of the
definition of MPEG-7 description schemes
–Schema Tools :
 facilitate the creation of valid MPEG-7 descriptions and packing..
–Basic Data types :
 Integer & Real – represent constrained integer and real value
 Vectors & Matrix – represent arbitrary sized vectors and matrices of
integer or real values
 Probability Vectors & Matrices – represent probability distribution
described using vectors/matrices
 String – represents codes identifying content type, countries, regions,
currencies, and character sets
–Linking, Identification and Localization Tools :
 tools for referencing MPEG-7 descriptions, for linking descriptions to
multimedia content and for describing time in multimedia content

14

MDS: Basic Elements
–Example: Three kinds of media time representation:
t1 t2
Duration
TimeBase

RelTimePoint

 A) Simple time: Specify a time point and a duration
 B) Relative time: Specify a media time point relative to a time base, and a
duration
 C) Incremental time: Specification of time using a predefined interval
called Time Unit and counting the number of intervals (efficient for
periodic signals)

15

MDS: Basic Elements
– Basic Description Tools : A library of description schemes and data types, which
are used as primitive components for building more complex and functionality-
specific description tools found in the rest of MPEG-7.
 Graph and relation tools: weave together complex multimedia description
structures <Graph>
<Node id = “A”/> <Node id = “A”/> <Node id = “A”/> <Node id
– Ex. = “A”/> <Node id = “A”/>
<Relation type = “#r1” source “#A” target = “#B”/>
r3 r3
D <Relation type = “#r2” source “#A” target = “#C”/>
C
B r1 …………..

A
E
r4 r1 r2 </Graph>

 Textual annotations: represent textual descriptions
– Free text annotation : Spain scores a goal against Sweden.
– Keyword annotation : score, Sweden, Spain
 Classification schemes and terms: define and reference vocabularies for
multimedia content descriptors.
– Ex. Part of a ClassificationScheme for sports:
sports
soccer basketball baseball tennis
16

MDS: Basic Elements
 People and locations: represent people and places related to
multimedia content
– Agent: persons, organizations, groups of persons,…
 Ex. <PersonGroup>
<Name>Spanish National Soccer Team </Name>
<Kind><Name>Soccer Team </Name></Kind>
<Member>
<Name> Fernando </Name>
</Member>
<Member>
….
</PersonGroup>

– Places: existing, historical, and fictional places.
 Affective description: describe emotional response to
multimedia content
– Ex. Recording an audience’s excitement while watching an action movie
 Ordering tools:
– Provides a hint for ordering descriptions for presentation based on
information contained in those descriptions
– Ex. Order a set of video segments in a soccer game by the amount of
17
camera zoom within each segment.

Content management and content description

18

MDS: Content Management
 Content management : the description of the life cycle of the
content, from content to consumption
– Creation and Production Description,
 Including title, textual annotation, creators, creation locations, dates, how the data
is classified, review and guidance information, and related multimedia material.
– Usage Description
 Describes information related to the usage rights, usage record, and financial
information.
 Rights information is not explicitly included in the description but links are
provided to the rights holders or right management.
 Usage record description provides information related to the use of the content,
such as broadcasting, or demand delivery.
 Financial information provides information related to the cost of production and
the income resulting from content use.
 Usage description is dynamic and subject to change during the lifetime of the
multimedia content.
– Media Description
 Describes the storage media in particular the compression, coding, and storage
format of multimedia content. It describes the master media that is the original
source from which different instances of the multimedia content are produced.
19

Content management and content description

20

MDS: Structural Content Description
 Content Description: structural and conceptual aspects
– Structure Description: describes the structure of multimedia built around the
notation of Segment Description Scheme that represents the spatial, temporal, or
spatiotemporal portion of the multimedia content
 Segment DSs (the core element)
– Example: Mosaic DS – panoramic view of video segment constructed by
aligning together and warping the frames of a Video Segment upon each other

21

 Specific features for structural data description

Feature Video Still Moving Audio
Segment Region Region Segment
Time X X X

Shape X X

Color X X X

Texture X

Motion X X

Camera X
motion

Audio X X X
features
22

 Examples of Image description with Still Regions

23

MDS: Conceptual Content Description
 Conceptual aspects: describes the multimedia content from
the viewpoint of real-world semantics and conceptual
notations.
– Involve entities such as objects, events, abstract concepts and relationships.
– Segment description schemes and semantic description schemes are related
by a set of links that allows the multimedia content to be described on the
basis of both content structure and semantics together.

24

MDS: Conceptual Content Description
 Example of video segments and Regions Corresponding SegmentRelationship Graph

25

Navigation and access

26

MDS: Navigation and Access
 Facilitating browsing and retrieval by defining summaries,
views, and variations of the multimedia content.
 Summaries: provide compact highlights of the multimedia
content to enable discovering, browsing, navigation, and
visualization of multimedia content.
– Hierarchical navigation mode
– Sequential navigation mode

27

MDS: Navigation and Access
 View: based on partitions and decompositions, which
describes different decompositions of the multimedia signals
in space, time, and frequency. The partitions and
decompositions can be used as different views of the
multimedia content  important for multi-resolution access
and progressive retrieval.

 Variations: provides different variations of multimedia
programs, such as summaries and abstract, scaled,
compressed and low-resolution versions and versions with
different languages and modalities – audio, video, image, text,
and so forth  allow the selection of the most suitable
variation of a multimedia program

28

Content organization

29

MDS: Content Organization
 Content Organization – tools describe collections and models
– Collection: unordered sets of multimedia content, segments, descriptor
instances, concepts or mixed sets of the above
(Example of collections of AV content including the relationships (i.e.
RAB,RBC,RAC) within and across Collection Clusters)

Collection structure

Content collection

Segment collection

Descriptor collection Collection (abstract)

Concept collection

Mixed collection
30

– Model tools: Parameterized representation of an instance or class
multimedia content, descriptors or collections, as follows:
 Probability model : Associates statistics or probabilities with the attributes of
multimedia content, descriptors or collections
 Analytic model: Associates labels or semantics with multimedia content or
collections
 Cluster model: Associates labels or semantics and statistics or probabilities with
multimedia content collections
 Classification model: Describes information about known collections of
multimedia content in terms of labels, semantics, and models that can be used to
classify unknown multimedia content

Model (abstract)

Classification Model
Probability Model Analytic Model Cluster Model

 Cluster Model
 Probability Model  Collection Model  ClusterClassification
Model
 Discrete distribution  Probability Model class
 ProbabilityClassification
 Continuous
Model
distribution
 Finite State Model 31

– Clusters of positive
and negative
examples of images
are described using
Cluster Model tool.

– Soccer video sequence
modeled using State
Transition Model tool.

32

User Interaction

33

MDS: User Interaction
 User interaction describes user preferences and usage history
 Allow matching between user preferences and MPEG-7
content description  facilitate personalization of multimedia
content access, presentation, and consumption.

34

Introduction : What is MPEG-7?
“Multimedia Content Description Interface”
–Intuition:
 NOT focus so much on processing tools
 Concentrate more on the selection of features that have to be described
 Find a way to structure and instantiate the selected features with a
common language
–Provide a way to get information about the audiovisual (AV)
data without the need of performing the actual decoding of these
data.
–Goal: allow interoperable searching, indexing, filtering and
access of multimedia content by enabling interoperability
among devices that deal with multimedia content description.

36

MPEG-7 Main Elements
 Descriptor (D) – provides standardized “audio only” and “visual only”
descriptors. <ex> a time code for duration, color histograms for color.
 Multimedia Description Scheme (MDS) – provides standardized description
schemes involving both audio and visual descriptors. <ex> a movie,
temporally structured as scenes and shots, including textual descriptors at the
scene level and color, motion, audio amplitude descriptors at the shot level.
 Description Definition Language (DDL) – provides a standardized language
to express description schemes,
– based on XML (eXtensible Markup Language) – a language that allows the creation
of new description schemes, and possibly, descriptors. Also allows the extension and
modification of existing description schemes.
 Coding Schemes – compressing MPEG-7 textual XML descriptions into
Binary format (BiM) to satisfy application requirements for compression
efficiency, error resilience, ...

 SYSTEM:

37

Visual Descriptors
 Cover 6 basic visual features as
–Color
–Texture
–Shape
–Motion
–Localization
–Face Recognition

38

Color descriptors
 Color Descriptors
– Color Space : defines the color components as continuous-value entities
 R, G, B
 Y, Cr, Cb
– Y = 0.299R + 0.587G + 0.114B
– Cb = – 0.169R – 0.331G + 0.500B Min (whiteness)
– Cr = 0.500R – 0.419G – 0.081B
 H, S, V (Hue, Saturation, Value)
– A nonlinear transform of the RGB
– Quantized into 16,32,64,128,256 bins for
scalable color descriptor and frames
histogram descriptor
 HMMD (Hue, Max, Min, Diff, Sum)
– Max = max (R, G, B)
– Min = min (R, G, B)
– Diff = Max – Min Max (blackness)
– Sum = (Max + Min ) / 2

 Linear transformation matrix with reference to R, G, B
– Any 3 x 3 color transform matrix that specifies the linear
transformation between RGB and the respective color space.
 Monochrome: Y component alone in YCrCb is used
39

Color Descriptors
–Color Quantization Descriptor : specifies the partitioning of the
given color space into discrete bins.
–Dominant Color Descriptor (DCD): allows specification of a small
number of dominant color values as well as their statistical properties, such as
distribution and variance  provides an effective an compact representation
of colors present in a region or an image.
 DCD is defined to be
F = {(ci, pi, vi), s}, (i = 1, 2, .. N), N is the number of dominant colors
ci  dominant color value, a vector of corresponding color space component
values
pi  the fraction of pixels in the image corresponding to ci
vi  the variation of the color values of the pixels in a cluster around the
corresponding representative color
s  the spatial coherency, represents the overall spatial homogeneity
(Examples of low and high spatial coherency of color)

40

Color Descriptors
–Scalable Color Descriptor : a Haar transform-based encoding
scheme applied across values of a color histogram in the HSV
color space
– Useful for image-to-image matching and retrieval based on color feature. Its
binary representation is scalable in terms of bin numbers and bit
representation accuracy over a broad range of data rate.
–Group-of-Frame or Group-of-Picture Descriptor :
 For joint representation of color-based features for multiple images or multiple
frames in a video segment
 Traditionally for a group of frames or pictures  a key frame or image is
selected and the color-related features of the entire collection are represented by
the chosen sample  unreliable
 By GoF and GoP  histogram based descriptors that reliably capture the color
content of multiple images or video frames.

41

Color Descriptors
– Color Layout Descriptor (CLD) : represents the spatial distribution of
representative colors on a grid superimposed on a region or image. Representation is
based on coefficients of Discrete Cosine Transform. This is a very compact
descriptor being highly efficient in fast browsing and search applications.
– Color Structure Descriptor (CSD): based on color histogram, but aims at
identifying localized color distributions using a small structuring window. To
guarantee, interoperability, the CSD is bound to the HMMD color space.
– CSD: the degree to which its pixels are clumped together relative to the scale of an
associated structuring element.

Examples of structured and unstructured color.

42

Texture Descriptors
 Homogeneous Texture Descriptor (HTD):
– provides a quantitative representation using 62 numbers, consisting of the
mean energy and energy deviation from a set of frequency channel
– Useful for similarity retrieval
– Effective in characterizing homogeneous texture regions
 Texture Browsing Descriptor (TBD):
– Defined for coarse level texture browsing
– Provides a perceptual characterization of texture, similar to human
characterization, in terms of regularity, coarseness and directionality of the
texture pattern.
 Edge Histogram Descriptor (EHD):
– Capture spatial distribution of edges in an image
– Useful in matching regions with partially varying, non-uniform texture.

43

Homogeneous Texture Descriptor
• Texture Descriptor
– Homogeneous Texture Descriptor (HTD): characterize the region
texture using the mean energy and the energy deviation from a set of
frequency channel. The 2D frequency plane is partitioned into 30
channels as the following:
(Frequency layout for
feature extraction)

ω

The Syntax of the HTD is as follows:
HTD = [fDC, fSD, e1, e2, ..,e30, d1, d2, .. ,d30]
Where fDC and fSD are the mean and standard deviation of input images, and ei
and di are the nonlinearly scaled and quantized mean energy and energy
44
deviation of the i-th channel.

Texture Browsing Descriptor
– Texture Browsing : Perceptual characterization of a texture, similar to a human
characterization, in terms of regularity, coarseness and directionality

– TBD = [v1,v2,v3,v4,v5]
 v1 ∈ {1, 2, 3, 4} or {00,01,10,11}: represents the regularity
 v2,v3 ∈ {1, 2, 3, 4, 5, 6} : capture the directionality of the texture
 v4, v5 ∈ {1, 2, 3, 4}: capture the coarseness of the texture

Regularity Semantics
00 irregular
01 slightly regular
10 regular
11 highly regular
Semantics of Regularity.

11 01 00
10
Regularity

Examples of Regularity
45

Edge Histogram Descriptor
– Edge Histogram: represents local edge distribution in the image
 Five types of edges: 5 histogram bins per each sub-image

BinCounts[k] Semantics
BinCounts[0] Vertical edges in sub-image (0,0)
BinCounts[1] Horizontal edges in sub-image (0,0)
BinCounts[2] 45 degree edges in sub-image (0,0)
BinCounts[4] Non-directional edges in sub-image (0,0)
 
BinCounts[76] Horizontal edges in sub-image (3,3)

46

Shape Descriptors
 Shape Descriptors
– Region-based Shape Descriptor
 Expresses pixel distribution within a 2-D object or region.
 Based on both boundary and internal pixels and can describe complex objects
consisting of multiple disconnected regions as well as simple objects with or
without holes.
– Contour-based Shape Descriptor
 Based on CSS representation of the contour
– 3-D Spectrum Descriptor
 Expresses characteristic features of objects represented as discrete polygonal 3-D
meshes.
 Based on the histogram of local geometrical properties of the 3-Dsurfaces of the
object.

47

Shape Descriptors
– Region-based shape descriptor utilizes a set of ART(Angular Radial
Transform) coefficients. Twelve angular and three radial functions are used
(n < 3, m < 12).

Fnm is an ART coefficient of order n and m. V is ART basis function and f is an image function

V (ART basis function) is separable along the angular and radial directions

(Real part of the ART basis functions)
 ART coefficients are divided by the magnitude of ART coefficient of order n= 0, m = 0, which is not used
as a descriptor element.
 Quantization is applied to each coefficient using 4 bit per coefficient to minimize the size of the descriptor
48

Shape Descriptors
– Contour-based Shape Descriptor : describes a closed contour of a 2D object or
region in image or video sequence. Based on the Curvature Scale Space (CSS)
representation of the contour

(A 2D visual object (region) and its corresponding shape)

Field No. of bits Meaning

No. of peaks 6 No. of peaks in CSS image
Circularity and eccentricity
2×6
GlobalCurvature
of the contour
Circularity and eccentricity
2×6
PrototypeCurvature
of the smoothed contour
Absolute height of the highest
HighestPeakY 7
peak (quantized)
X-position on the contour of a
PeakX[] 6
peak (quantized)
Height of the peak
PeakY[] 3
(quantized)

(CSS Image Formation)
49
Smoothing evolution of zero-crossing

Shape Descriptors
 Contour-based Shape Descriptor has the following properties
• It can distinguish between shapes that have similar region-shape properties but
different contour-shape properties.

– · It supports search for shapes that are semantically similar for humans

– · It is robust to significant non-rigid deformations

– · It is robust to distortions in the contour due to perspective transformations, which are
common in the images and video

– · It is robust to noise present on the contour.
– · It is very compact (14 Bytes per contour on average).
– · The descriptor is easy to implement and offers fast extraction and matching.

50

Shape Descriptors
(3-Dimensional Class)
– 3-D Shape spectrum descriptor : This descriptor specifies an intrinsic shape
description for 3D mesh models. It exploits some local attributes of the 3D surface.

 The shape index, introduced by Koenderink, is defined as a function of the two principal
curvatures, and associated with point p on the 3D surface S.

with

 By definition, the shape index value is in the interval [0,1]
 The shape spectrum of the 3D mesh (3D-SSD) is the histogram of the shape indices (Ip‘s)
calculated over the entire mesh.

51

Motion Descriptors
 Camera Motion Descriptor
 Motion Trajectory Descriptor
 Parametric Motion Descriptor
 Motion Activity Descriptor

Moving region
Video segment

Camera motion Mosaic
Motion trajectory
Motion activity
Warping
Parametric motion
parameters
52

Motion Descriptors
 Motion Descriptors
– Camera Motions: pan, track, tilt, boom, zoom, dolly, roll, absence

perspective projection and camera
motion parameters

53

Motion Descriptors
– Motion Trajectory : describes the displacements of objects in time. A high
level feature associated to a moving region, defined as the spatiotemporal
localization of one of its representative points (such as its center) as a list of key
points (x, y, z, t)
– Parametric Motion : describing the motion of objects in video sequences as a 2D
parametric model.
 Affine Models (6): translations, rotations, scaling and combination of these.
 Planar Perspective Models (8) : Global deformations with perspective projections
 Quadratic Models (12) : describes more complex movements
– Motion Activity : Intuitive notion of ‘intensity of action’ or ‘pace of action’ in a
video segment.
 Example of high “activity”: Goal scoring in a soccer match
 Can be used in diverse applications such as content repurposing, video summarization,
surveillance, content-based querying, etc.
 Four attributes:
– Intensity of activity: indicate high or low activity by a integer lying in [1—5]
– Direction of activity: expresses the dominant direction of the activity if any
– Spatial distribution of activity: the number and size of active regions in a frame
– Temporal distribution of activity: expresses the variation of activity over the duration
54

Localization Descriptors
 Localization Descriptors
– Region Locator : Localization of regions within images or frames by specifying
them with a brief and scalable representation of a Box or a Polygon. Procedure
consists of the following 2 steps
 Extraction of vertices of the region to be localized
 Localization of the region within the image or frame

(localization using a polygonal and Box element of the RegionLocator)
– Spatio Temporal Locator: describes spatial-temporal regions in a video
sequence, such as moving object regions, and provides localization
functionality.

55

Face Recognition Descriptor

FaceRecognition Descriptor : Used to retrieve face images which match a query
face image.
–Face Recognition : The projection of a face vector onto a set of 48 basis
eigenvectors U (‘eigenfaces’) which span the space of possible face vectors.
–Feature Extraction : The FaceRecognition feature set is extracted from a
normalized face image. This normalized face image contains 56 lines with 46
intensity values in each line. The centre of the two eyes in each face image are
located on the 24th row and the 16th and 31st column for the right and left eye
respectively.
Features are given by the vector W
and is the mean face vector.
The features are normalized and clipped using Z=2048 as follows.

56

Face descriptor
– Automatic Face Image Localization

(Block Diagram of the Automatic face Image Localization algorithm)
 Color Segmentation

(A color segmentation example: a) the skin color region in the Cb-Cr plane
b) original image c) results of the color segmentation algorithm)

57

Audio descriptors
 Overview of Audio Framework including Descriptors

58

Audio Descriptors
 Basic Descriptors: temporally sampled scalar values for general use,
applicable to all kinds of signals
– AudioWaveform Descriptor : Audio waveform envelope (minimum and
maximum), typically for display purposes
– AudioPower Descriptor : the temporally smoothed instantaneous power,
which is useful as a quick summary of a signal, and in conjunction with the
power spectrum.

 Basic Spectral Descriptors: all deriving from a single time-frequency
analysis of an audio signal
– AudioSpectrumEnvelope Descriptor : a logarithmic-frequency spectrum,
spaced by a power-of-two divider (multiple of an octave)
– AudioSpectrumCentroid Descriptor : the center of gravity of the log-
frequency power spectrum, which describes the shape of the power
spectrum

59

Audio Descriptors
– AudioSpectrumSpread Descriptor : complementary of the previous descriptor
by describing the second moment of log-frequency power spectrum. This may
help distinguish between pure-tone and noise-like sounds
– AudioSpectrumFlatness Descriptor : the flatness properties of the spectrum of
an audio signal for each of a number of frequency bands. When this indicates a high
deviation from a flat spectral shape for a given band, it may signal the presence of
tonal components
(Example of AudioSpectrumEnvelope description of a pop song)
Visualized using a spectrogram.
Required data storage is NM values
where N is the no. of spectrum bins
and M is the no. of time points

60

Audio Descriptors
 Spectral Basis Descriptor: low-dimensional projections of a high-
dimensional spectral space to aid compactness and recognition, which are
used primarily with the Sound Classification and Indexing Description Tools
– AudioSpectrumBasis : a series of basis functions that are derived from the
singular value decomposition of a normalized power spectrum
– AudioSpectrumProjection : Used with above descriptor, and represents low-
dimensional features of a spectrum after projection upon a reduced rank basis.
(Example: A 10-basis component reconstruction showing most of the detail of the
original spectrogram including guitar, bass guitar, etc.)
The left vectors are an AudioSpectrumBasis
Descriptor and the top vectors are the
corresponding AudioSpectrumProjection
Descriptor. The required data storage is
10(M+N) values

61

Audio Descriptors
 Signal Parameters : apply chiefly to periodic or quasi-periodic
signals

– AudioFundamentalFrequency Descriptor : fundamental frequency of an
audio signal, which represents for a confidence measure in recognition of
the fact that the various extraction methods, commonly called “pitch-
tracking”, are not perfectly accurate.

– AudioHarmonicity Descriptor : the harmonicity of a signal, allowing
distinction between sounds with a harmonic spectrum (e.g., musical tones
or voiced speech [vowels like ‘a’]), sounds with an inharmonic spectrum
(e.g., metallic or bell-like sounds) and sounds with a non-harmonic
spectrum (e.g., noise, unvoiced speech [fricatives like ‘f’], or dense
mixtures of instruments).

62

Audio Descriptors
 Timbral Temporal Descriptor : temporal characteristics of segments
of sounds, useful for the description of musical timbre( characteristic tone
quality independent of pitch and loudness).
– LogAttackTime Descriptor : the ‘attack’ of a sound, the time it takes for the signal
to rise from silence to the maximum amplitude. It tells the difference between a
sudden and a smooth sound
– TemporalCentroid Descriptor : the signal envelope, representing where in time the
energy of a signal is focused. It is used for the distinction between a decaying piano
note and a sustained organ note, when the lengths and the attacks of the two notes
are identical.
 Timbral Spectral Descriptor : spectral features in a linear-frequency
space especially applicable to the perception of musical timbre.
– SpectralCentroid Descriptor : the power-weighted average of the frequency of the
bins in the linear power spectrum. Very similar to the AudioSpectrumCentroid, but
specialized for use in distinguishing musical instrument timbres. It tells the
“sharpness” of a sound.

63

Audio Descriptors
– HarmonicSpectralCentroid Descriptor : the amplitude-weighted mean of the
harmonic peaks of the spectrum. It has a similar semantic to the other centroid
descriptors, but applies only to the harmonic parts of the musical tone.
– HarmonicSpectralDeviation Descriptor : the spectral deviation of log-amplitude
components from a global spectral envelope.
– HarmonicSpectralSpread Descriptor : the amplitude-weighted standard deviation
of the harmonic peaks of the spectrum, normalized by the instantaneous
HarmonicSpectralCentroid.
– HarmonicSpectralVariation Descriptor : the normalized correlation between the
amplitude of the harmonic peaks between two subsequent time-slices of the signal.
 Silence Segment : attaches the simple semantic of “silence” (i.e. no
significant signal) to an Audio Segment. It may be used to aid further
segmentation of the audio stream, or as a hint not to process a segment.

64

Audio Descriptors
 High-level Audio Description Tools (Ds and DSs)
– Audio Signature DS : A condensed representation of an audio signal designed to
provide a unique content for the purpose of robust automatic identification of audio
signals. Applications include audio fingerprinting, identification of audio based on a
database of known works
– Musical Instrument Timbre Description Tools
 HarmonicInstrumentTimbre Descriptor : Four harmonic timbral spectral
Descriptors with the LogAttackTime Descriptor
 PercussiveInstrumentTimbre Descriptor : The timbral temporal Descriptors
with a SpectralCentroid Descriptor
– Melody Description Tools
 Include a rich representation for monophonic melodic information to
facilitate efficient, robust, and expressive melodic similarity matching.
 MelodyContour DS: terse, efficient melody contour representation
 MelodySequence DS: a more verbose, complete, expressive melody
representation

65

Audio Descriptors
– General Sound Recognition and Indexing Description Tools
 A collection of tools for indexing and categorization of sound (effects) in
general
 SoundModelStatePath Descriptor: states generated by a sound model
 SoundModelStateHistogram Descriptor: normalized histogram of the state
sequence generated by a sound model

– Spoken Content Description Tools
 Consists of combined word and phone lattices for each speaker in an audio
stream. Use phone lattices to alleviate out-of-vocabulary problem (OOV)
 SpokenContentLattice Description Scheme : the actual decoding produced by
an ASR(Automatic Speech Recognition) engine.
 SpokenContentHeader : information about the speakers being recognized and
the recognizer itself.

66

References
 Book – Introduction to MPEG-7: Multimedia Content
Description Interface
B. S. Manjunath (Editor), Philippe Salembier (Editor), Thomas
Sikora (Editor)
ISBN: 0-471-48678-7
http://www.wiley.com/WileyCDA/WileyTitle/
productCd-0471486787.html

 MPEG-7
http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm

 MPEG-7 DDL Homepage
http://archive.dstc.edu.au/mpeg7-ddl/

67

Mpeg7

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mpeg7

Similar to Mpeg7 (20)

Recently uploaded

Recently uploaded (20)

Mpeg7