SlideShare a Scribd company logo
1 of 28
Download to read offline
Semantic annotation and retrieval
of documentary media objects
Dimitris Kanellopoulos
Educational Software Development Laboratory, Department of Mathematics,
University of Patras, Rio Patras, Greece
Abstract
Purpose – This paper aims to propose a system for the semantic annotation of audio-visual media
objects, which are provided in the documentary domain. It presents the system’s architecture, a
manual annotation tool, an authoring tool and a search engine for the documentary experts. The paper
discusses the merits of a proposed approach of evolving semantic network as the basis for the
audio-visual content description.
Design/methodology/approach – The author demonstrates how documentary media can be
semantically annotated, and how this information can be used for the retrieval of the documentary
media objects. Furthermore, the paper outlines the underlying XML schema-based content description
structures of the proposed system.
Findings – Currently, a flexible organization of documentary media content description and the
related media data is required. Such an organization requires the adaptable construction in the form of
a semantic network. The proposed approach provides semantic structures with the capability to
change and grow, allowing an ongoing task-specific process of inspection and interpretation of source
material. The approach also provides technical memory structures (i.e. information nodes), which
represent the size, duration, and technical format of the physical audio-visual material of any media
type, such as audio, video and 3D animation.
Originality/value – The proposed approach (architecture) is generic and facilitates the dynamic use
of audio-visual material using links, enabling the connection from multi-layered information nodes to
data on a temporal, spatial and spatial-temporal level. It enables the semantic connection between
information nodes using typed relations, thus structuring the information space on a semantic as well
as syntactic level. Since the description of media content holds constant for the associated time
interval, the proposed system can handle multiple content descriptions for the same media unit and
also handle gaps. The results of this research will be valuable not only for documentary experts but for
anyone with a need to manage dynamically audiovisual content in an intelligent way.
Keywords Documentary, Semantic annotation, Video, Temporal and spatial levels of audiovisual data,
Content management, Audiovisual media, Multimedia
Paper type Research paper
1. Introduction
In the last few years, the general public’s interest in documentaries has grown
enormously. A documentary is the presentation of factual events, often consisting of
footage recorded at the time and place of their occurrence and generally accompanied by
a narrator (Rosenthal and Corner, 2005). Documentary is a media work category, applied
to photography, film and television. It has been developed internationally across a wide
range of formats, including the use of dramatization, observational sequences and
various combinations of interview material with images that portray the real with
deferent degrees of referentiality and aesthetic crafting. Documentaries often depict
various important topics (e.g. animal life, historical events, tourist attractions etc) by
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/0264-0473.htm
Documentary
media objects
721
Received October 2011
Revised February 2012
Accepted March 2012
The Electronic Library
Vol. 30 No. 5, 2012
pp. 721-747
q Emerald Group Publishing Limited
0264-0473
DOI 10.1108/02640471211275756
mixing photos and videos with commentaries and opinions from experts. All these
elements are organized in narrative form. The definition of documentary often
undertakes a discursive path. Two factors play consistently in various definitions:
(1) reality is captured in some forms of documents; and
(2) the documents are subjected to assemblage to serve a larger context.
For the definition of documentary, we adopt the simplest task definition, that of Vertov:
“to capture fragments of reality and combine them meaningfully” (Barnouw, 1993, p. 55).
It can be said that making documentaries is not a piece of science. Documentaries can
relate data from science, but they are not scientific reports. They mix science, narrative,
images, while the filmmakers’ point of view affects the way these are mixed. For
example, a travel documentary is a documentary film (or television program) that
describes travel or tourist attractions in a non-commercial way. It is not a scientific report
but it is based on knowledge about tourist attractions. A representative travel
documentary is Word Travels (IMDb, n.d.) that follows the lives of two young
professional travel writers (Robin Esrock and Julia Dimon), as they journey around the
world in search of stories to experience, write about, and file for their editors.
According to Nichols (2001) in documentary film and video, we can identify six
modes of representation that function something like sub-genres of the documentary
film genre itself: poetic, expository, participatory, observational, reflexive, and
performative. Table I shows the main characteristics and deficiencies of these
documentary modes.
Modern lightweight digital video cameras and computer based-editing have really
aided documentary makers. The first film to take full advantage of this change was
Martin Kunert and Eric Manes’ Voices of Iraq, where 150 digital video cameras were sent
to Iraq during the war and passed out to Iraqis to record themselves. Multimedia
technology allows text, graphics, photos, and audio to be transmitted effectively and
Documentary mode Main characteristics Deficiencies
Poetic documentary (1920s) Reassemble fragments of the
world poetically
Lack of specificity, too
abstract
Expository documentary (1920s) Directly address issues in the
historical world
Overly didactic
Observational documentary (1960s) Eschew commentary and
reenactment; observe things
as they happen
Lack of history, context
Participatory documentary (1960s) Interview or interact with
subjects; use archival film to
retrieve history
Excessive faith in witnesses,
naive history, too intrusive
Reflexive documentary (1980s) Question documentary form,
defamiliarize the other modes
Too abstract, lose sight of
actual issues
Performative documentary (1980s) Stress subjective aspects of a
classically objective discourse
Loss of emphasis on
objectivity may relegate such
films to the avant-garde;
“excessive” use of style
Table I.
Documentary modes
EL
30,5
722
rapidly across media platforms. Media organizations must cope with multimedia
changes that move exponentially to the next competing delivery device. Nowadays, there
is a potentially wide range of applications in the media domain such as search, filtering
of information, media understanding (surveillance, intelligent vision, smart cameras etc.)
or media conversions (speech to text, picture to speech, visual transcoding etc).
Understanding semantics and meaning of documentaries is directly needed (Choi, 2010).
Finding the bits of interest (the important part of a documentary) becomes increasingly
difficult, frustrating, and a time consuming task. Internet users need an intelligent search
engine for performing complex media search and help users finding media chunks based
on semantics in media itself (Dorai et al., 2002). However, media is so rich in its content
variety that it will never sufficiently be described by text or words (Dorai and Venkatesh,
2001). Besides, humans must take the time to annotate the media chunks.
Media information systems for documentaries should incorporate mechanisms that
interpret, manipulate and generate visual media as well as audible information. A
media infrastructure for documentaries should manipulate self-sufficient components
of documentaries, which can be used in any given production. In order to use such an
independent media item, it is required to extract the relationship between the signs of
the audio-visual information unit and the semantics they represent (Eco, 1997). As a
result, media information systems for documentaries such as Terminal_Time (Mateas,
2000) should manage independent media objects and their representations for use in
many different productions. Therefore, we need tools that utilize human actions to
extract the important syntactic, semantic and semiotics aspects of its content
(Brachman and Levesque, 1983) in order descriptions (based on a formal language) can
be constructed. The increasing amount of various documentaries and their
combinatorial use requires the annotation of media during their production.
Media annotation and querying for documentaries is still a major challenge, as the
gap between the documentary features and the existing media tools is wide. In the last
two decades, many authoring tools have been proposed for multimedia data (Tien and
Cecile, 2003; Ryn et al., 1989). These authoring tools are either application dependent or
provide insufficient authoring features. High-level annotation facilities like annotation
of objects, time, location, events etc can be provided by existing video annotation tools
such as Vannotator (Costa et al., 2002), IBM VideoAnnEx (IBM, n.d.), ELAN (The
Language Archive, n.d.), CAVIAR (The University of Edinburgh, n.d.), and ViPER-GT
(Sourcegorge.net, n.d.). Rincon and Martinez-Cantos (2007) describe a video annotation
tool (called AVISA) for video understanding. They analyze the features that must be
present in a video annotation tool for video understanding. However, these features
need to be complemented with finer level annotation methods that are required for the
video documentaries. Automatic video generation systems use descriptions
(annotations) of the media items in order to make decisions about how to create a
video sequence. The structure of annotations is composed of two parts:
(1) The structure of the description (e.g. a documentary film can be described by
fields, such as title, director).
(2) The structure of the values used to fill the description (e.g. “The Civil War” can
be the value of the field title).
According to Bocconi et al. (2008) there are three different types of description
structures:
Documentary
media objects
723
(1) Keywords-based description structures (or K-annotations), in which each item is
associated with a list of words that represent the item’s content. Representative
video generation systems that use K-annotations are Lev Manovich’s Soft Cinema
(n.d.) and the Korsakow System (Korsakow, n.d.) , systems that edit in real-time
by selecting media items from a database. ConTour (Murtaugh, 1996) is another
indicative system that supports evolving documentaries, i.e. documentaries that
could incorporate new media items as soon as they were made.
(2) Properties-based description schemes (or P-annotations) in which items are
annotated with property-value pairs. Representative system of this category is
SemInfo (Little et al., 2002).
(3) Structure-based on relations (or R-annotations). Here, items are annotated with
property-value pairs as in P-annotations only that some of these values are
references to other annotations. A representative system is DISC (Geurts et al.,
2003), which is a multimedia presentation generation system for the domain of
cultural heritage. DISC uses the annotated multimedia repository of the
Rijksmuseum (n.d.) to create multimedia presentations.
Benitez et al. (2000) presented description schemes (DSs) for image, video, multimedia,
home media, and archive content proposed to the MPEG-7 standard. They used the
XML to illustrate and exemplify their description schemes by presenting applications
that already use the proposed structures. These applications are the visual apprentice,
the AMOS-search system, a multimedia broadcast news browser, a storytelling
system, and an image meta-search engine, MetaSEEk.
The AUTEUR system (Nack and Parkes, 1997) synchronizes automatic story
generation for visual media with the stylistic requirements of narrative and medium
related presentation. The AUTEUR system consists of an ontological representation of
narrative elements such as actions, events, and emotional and visual codes, based on a
semantic net of conceptual structures related via six types of semantic links
(e.g. synonym, sub-action, opposition, ambiguity, association, conceptual). A coherent
action-reaction dynamic is provided by the introduction of three event phases,
i.e. motivation, realization and resolution. The essential categories for the structures
are action, character, object, relative position, screen position, geographical space,
functional space and time. The textual representation of this ontology describes
semantic, temporal and relational features of video in hierarchically organized
structures, which overcomes the limitations of keyword-based approaches.
We believe that formal semantics can support the annotation, analysis, retrieval or
reasoning about multimedia assets in the documentary industry. The proliferation of
documentaries and their applications require media annotation that bridges the gap
between documentary technology and media semantics. In line with this, Dorai and
Venkatesh (2001, p. 10) state:
A serious need exists to develop algorithms and technologies that can annotate content with
deep semantics and establish semantic connections between media’s form and function, for
the first time letting users access indexed media and navigate content in unforeseeable and
surprising ways.
The aim of this paper is to propose an agent-oriented programming approach using a
framework for describing the inherent semantics of the documentaries pieces. In
EL
30,5
724
agent-oriented programming, agent-oriented objects typically have just one method,
with a single parameter. This parameter is a sort of message that is interpreted by the
receiving object, or “agent”, in a way specific to that object or class of objects.
Documentaries pieces are unique to video documentaries. For this reason, we have
created a domain specific representation for the documentary pieces to improve the
retrieval accuracy of the documentary video queries.
The remainder of the paper is structured as follows. In Section 2, we discuss issues
concerning documentary authoring, while in Section 3 we present the semantics of
documentary media. In Section 4 we describe the system architecture. In Section 5, we
present our approach for implementing the repository for documentaries; our semantic
network based approach for the data storage and management and we illustrate the
proposed XML schema-based representational structures. In Section 6, we explain the use
of the proposed system through the tools for annotation, semi-automatic authoring and
semantic retrieval that we have implemented for the documentary video environments.
Finally, in Section 7 we conclude the paper and give directions for further work.
2. Documentary authoring
The conventional understanding of documentary production involves a three-phase
workflow:
(1) pre-production;
(2) production; and
(3) post-production.
Figure 1 illustrates a traditional documentary production model.
The production model formalizes a cyclic process as opposed to a linear workflow.
Pre-production is a phase of research and ideation where visions are selectively audited
through sketches mostly in text and graphical forms. Production and Post-production
are the phases of iterative processes for gathering and assessing media resources.
Screening is a main method for assessment through daily production and plays an
important role in assessments of daily results and edited sequences, determining
further materials needed and methods for acquiring the materials. In particular, a
documentary screening is the displaying of a documentary referring to a special
showing as part of a documentary’s production and release cycle. The different types
of screenings follow here in their order within a documentary’s development:
(1) Test screening. For early edits of a documentary, informal test screenings are
shown to small target audiences to judge if a documentary will require editing,
reshooting or rewriting.
(2) Focus group screenings are formal test screenings of a documentary with very
detailed documentation of audience responses.
(3) Critic screenings are held for national and major market critics well in advance of
print and television production-cycle deadlines, and are usually by invitation only.
(4) Public preview screenings may serve as final test screenings used to adjust
marketing strategy (radio and TV promotion, etc) or the documentary itself.
(5) A sneak preview is an unannounced documentary screening before formal
release, generally with the usual charge for admission.
Documentary
media objects
725
Actually, media production for documentaries is a complex, resource demanding
process that provides a multidimensional network of relationships among the
multimedia information.
Documentary authoring is based on the fundamental processes of media or
hypervideo production. Aubert et al. (2008) identified these fundamental (or canonical)
processes that can be supported in semantically aware media production tools.
According to Aubert et al. (2008) these processes are:
.
Premeditate (1) Inscription of marks/organization/browsing. The premeditate
process takes place in every step of the authoring activity. Input: thoughts of the
author. Output: necessary schemas, annotations, queries or views.
.
Create (2) This process exploits existing audiovisual documents.
.
Package (3) Inscription of marks/organization/browsing. The metadata structure
and accompanying queries and views are present, and can be materialized
package.
.
Annotate (4) Inscription of marks. Creation of the annotations, with
spatio-temporal links to the media assets. Input: Media sources. Output:
annotation structure.
.
Query (5) Organization. Queries allow selecting appropriate annotations. Input:
basic elements. Output: basic elements matching a specify query.
.
Construct message (6) Organization. Structuration of the presentation of data.
Input: the ideas from the premeditate process, the annotation structure, queries.
Output: draft of views.
Figure 1.
Traditional documentary
production model
EL
30,5
726
.
Organize (7) Organization. Definition of views to render the selected annotations.
Input: basic elements. Output: view definitions.
.
Publish (8) Browsing, Publishing. Content packaging-publishing, means
generation of documents from the templates, occurs in the browsing phase
and also in the publishing phase. Input: basic elements. Output: a package and/or
rendered views.
.
Distribute (9) Browsing, Publishing. The rendition of view is currently done
through a standard web browser, or the instrumented video player integrated
into the prototype.
Hardman et al. (2008) identified a small set of canonical processes and specified their
inputs and outputs, but deliberately do not specify their inner workings, concentrating
rather on the information flow between them. Indicative examples of invoking
canonical processes are given in (Aubert et al., 2008). Currently, many standards
facilitate the exchange between the different media process stages (Pereira et al., 2008),
such as MXF (Media Exchange Format), AAF (Advance Authoring Format), MOS
(Media Object Server Protocol), and Dublin Core.
The process of documentary authoring can be arranged in three phases: modeling,
annotation and authoring of documentary media.
(1) The modeling phase identifies the various semantics that exist in the
documentary media.
(2) The annotation phase provides the human annotator the various utilities for the
free text representation of their perception of the documentary.
(3) The authoring phase is meant for the semiautomatic translation of the
annotated media information into XML, validated by the XML Schema
validation tools. Using XML technologies, the semantic multimedia content of
the documentary can be represented in an interoperable way. It is a good idea to
propose substantial customizations based on XML technologies for the
documentaries. Thus, the produced item will be an XML document that
represents the annotation of the real-time video documentary.
Documentary information systems must accommodate these three phases, providing a
common framework for the storage of the authored documentary and for its presentation
interface. Documentary analysis tools should perform the interpretation of
documentaries in the context of culture, mode of documentary, mode of speech, action,
gestures and emotions. Existing tools and systems provide annotation features for the
documentary videos often based on a particular type of documentary (Mateas, 2000). In
addition, they offer a limited number of annotation facilities, thus it becomes difficult to
derive generic facilities. These tools do not provide semiautomatic authoring, which is an
important requirement. It is worth mentioning that Bocconi et al. (2008) describe a model
for automatically generating video documentaries. This allows viewers to specify the
subject and the point of view of the documentary to be generated. However, the domain
of Bocconi et al. is matter-of opinion documentaries based on interview.
Agius and Angelides (2005) proposed the COSMOS-7 system that models the objects
along with a set of events in which the objects participate, as well as events along with a
set of objects and temporal relationships between the objects. This system/model
Documentary
media objects
727
represents the events at a higher level only like speak, play, listen and not at the level of
actions, gestures and movements. Harry and Angelides (2001) proposed a semantic
content-based model for semantic-level querying that makes full use of the explicit media
structure, objects, spatial relationships between objects, events and actions involving
objects, temporal relationships between events and actions, and integration between
syntactic and semantic information. Ramadoss and Rajkumar (2007) considered a system
for the semiautomatic annotation of an audio-visual media of dance domain, while Nack
and Putz (2004) presented a framework for the creation, manipulation, and
archiving/retrieval of media documents, applied for the domain of News. In the digital
games and entertainment industry, Burger (2008) stressed the importance of the use of
formal semantics (ontologies) by providing a potential solution based on semantic
technologies. AKTive Media (Chakravarthy et al., 2006) is an ontology-based cross-media
annotation (images and text) system. It includes an automatic process of annotation by
suggesting knowledge to the user in an interactive way while the user is annotating. This
system actively works in the background, interacting with web services and queries the
central annotational store to look for context specific knowledge. Chakravarthy et al.
(2009) present OntoFilm, a core ontology for film production. OntoFilm provides a
standardized model, which conceptualizes the domain and workflows used at various
stages of the film production process starting from pre-production and planning, shooting
on set, right through to editing and post-production.
In this paper, we propose a documentary video framework in order to incorporate
media semantics for documentaries. This framework provides the XML authored
content of the documentary from the supplied semantic and semiotic annotations by
the human annotators. The proposed requirements are:
(1) A layer oriented model depicting the documentary pieces as events, which
incorporates the gesture, actions and spatial-temporal relationships of the
subjects (e.g. documentarists) and objects in a documentary. Besides
documentary pieces, other examples for events are setup, background scene
change, role change by a documentarist.
(2) A semantic network representing the documentary, the individual documentary
pieces, besides the cognitive aspects, setting, cultural features and story.
(3) An annotation tool for the documentary experts to manually perform the
semantic and semiotic annotations of the documentary media objects like
documentary, documentarists etc.
(4) A semantic querying tool for the documentary experts and users/spectators to
browse and query the documentary media features for designing new
documentary sequences. Some examples of documentary media or video
queries are:
.
show me all the pieces of natural history documentaries from Africa;
.
tell me all documentary pieces where documentarist is in danger; and
.
find all historical documentary pieces representing the invasion of
Normandy etc.
The query engine should be assisted by proper representations so that the retrieved
result achieves high precision and high recall.
EL
30,5
728
3. The semantics of documentary media
The spatial-temporal delivery of a sequence of the documentary pieces is recorded in a
documentary video, in which each documentary piece consists of a set of subject’s
actions. Each subject action denotes the action of the characters, such as commentarist,
speaker, interviewee etc. The action is represented as , subject-verb-object-adverb .
using verb-argument structure (Sarkar and Tripasai, 2002) that exists in Linguistics.
This section explains some of the characteristics of documentary media briefly.
Definition 3.1 (Documentary)
The documentary numbered i DCi;n
À Á
consists of a set of documentary video clips
Ci;j
À Á
performed at a particular setting. That is, DCi;n ¼ Ci;1; Ci;2; . . . ; Ci;n
È É
where n
is the total number of documentary clips. In this sense, the documentary DC2;3 ¼
C2;1; C2;2 C2;3
È É
denotes the second documentary that consists of three documentary
clips C2;1; C2;2; C2;3
À Á
. For example, if the second documentary DC2;3 is a travel
documentary and is presenting Holidays in Greece, then the three video clips could be
C2,1 ¼ Arriving at the airport of Athens, C2,2 ¼ Touring Athens and C2,3 ¼ Cruise in
the Rodos island.
Definition 3.2 (Documentary Clip)
A documentary clip Ci;j of the documentary DCi;n consists of a set of documentary
pieces (DP) that are performed by the documentarists. That is, Ci;j;m ¼
DPi;j1; DPi;j2; . . . ; DPi;jm
È É
where m is the total number of documentary pieces.
For example, the documentary clip C2;3;7 ¼
DP2;3;1; DP2;3;2; DP2;3;3; DP2;3;4; DP2;3;5; DP2;3;6; DP2;3;7
È É
denotes the third video clip
(in our example Cruise in the Greek islands) of the second documentary. This clip
includes seven documentary pieces: DP2,3,1, DP2,3,2, DP2,3,3, DP2,3,4, DP2,3,5, DP2,3,6,
DP2,3,7.
Definition 3.3 (Documentary Piece)
A documentary piece is the basic semantic unit of a documentary, which has a set of
subject’s actions that are performed either sequentially or concurrently by the subjects
(documentarists). It encapsulates the mood, genre, culture, and characters, apart from
the actions. A documentary piece DPi;j;k
À Á
of the video clip Ci, j represents a meaningful
sequence of subject’s (documentarist) actions (A). DPi;j;k ¼ A1; A2; . . .Akf g where k is
the total number of subject’s actions in this documentary piece. For example, the
documentary piece DP2;3;4 ¼ A1; A2; A3; A4f g denotes that piece of the third video clip
that (belongs to the second documentary) includes the first four sequential actions
A1; A2; A3; A4
À Á
performed by the subject (documentarist). In our example, these
actions could be:
A1: “The documentarist is visiting the main attractions of the Rodos island in
Greece”.
A2: “The documentarist is taking a swim”.
A3: “The documentarist is participating in the local festival”.
A4: “The documentarist is taking a taste of Rodos nightlife”.
Documentary
media objects
729
Definition 3.4 (Subject’s (documentarist) action)
The subject/documentarist’s action (A) is represented by an action of a character and is
defined as a tuple, , Agent-Action-Target-Speed . where agent and target are the
body-parts of the subject/object, action represents the static poses and gestures in the
universe of actions and speed denotes the speed of the delivery of the actions, that is
speed ¼ (low, medium, fast, gradual ascending, gradual descending). If only one agent
involves in an action, then it is called primitive action. That is, the target agent is empty
or Nil. For example, , documentaristi.larm move- nil-fast . shows that documentarist i
moves his left arm fast. If multiple agents involve in an action or gesture, then the action
is known as composite action. For instance, , Documentaristi.rhand – touch –
gorillaj.head – low . denotes that documentarist i touches the head of gorilla j slowly
with his right hand. The content representational structures for these documentary
media semantics are discussed in following sections.
4. The architecture for authoring and querying documentaries
The proposed system (shown in Figure 2) provides an environment supporting the
annotation, authoring, archiving and querying of the documentary media objects. The
aim is to apply the framework to all sorts of documentary types such as natural history
documentary, travel documentary etc.
The environment is based on various modules: annotation, archival, querying,
representation structures and the underlying database. The documentary experts
access each of these modules to carryout their specific tasks. It is essential for our
developments that these modules need to be easy and simple for use, thereby
minimizing the complexity of acquaintance with the system. The annotation module
takes the raw digital video as input and allows the human annotator to annotate the
different documentary media objects. The generated annotations are described in the
representational structures such as linked lists and hash tables. The authoring module
takes the annotations representing the documentary sequence and translates them into
XML instances automatically. The XML Schema instances that are instantiated by the
authoring module are stored in the back-end database. The query-processing module
allows the documentary experts to pose the different free-text documentary video
queries to the XML annotation, performs search using XQuery (after stemming,
Figure 2.
The architecture of the
proposed system
EL
30,5
730
removing the stop words and converting the tokens into XQuery form) and returns the
results of these queries back to the users. Based on the observation, we have identified
a set of required data structures and the associated relations and have developed tools
for accomplishing the documentary video tasks. Figures 3-5 depict the annotation,
query and semantic annotation processes correspondingly.
Figure 5.
The semantic annotation
process in a UML class
diagram
Figure 4.
The query process in a
UML class diagram
Figure 3.
The annotation process in
a UML class diagram
Documentary
media objects
731
5. The model of semantics for documentary media
According to Nack and Putz (2004) annotation is a dynamic and iterative process, and
thus annotations should be incomplete and change over time. Consequently, it is
imperative to provide semantic representation schemes with the capability to change
and grow. In addition, the relation between the different types of structures should be
flexible and dynamic. To achieve this, media annotation should not result to a
monolithic document, rather it should be organized as a semantic network of content
description documents (Ramadoss and Rajkumar, 2007).
5.1 Layer oriented event description
In the design of the proposed system, we adopted the strata-oriented approach
(Aguierre Smith and Davenport, 1992) and setting (Parkes, 1989) for describing the
events such as documentary pieces. Strata oriented content modeling is an
important knowledge representation method and more suitable to model the events
of the documentary presentation. In our framework, each video documentary is
technically described using the size, duration, technical format of the material such
as such as mpg, avi etc. Therefore, each documentary can be represented partially
using technical details that belong to the layer of technical details. In addition, each
video documentary is conceptually annotated using high-level semantic descriptors
and thus it can be complementarily represented using such semantic descriptors
that belong to the layer of semantic annotations. The connection between the
different layers is accomplished by a triple , media identifier, start time, end
time . . The proposed representation structure includes many layers (one layer for
each description). The triple identifier is applied in order to be achieved the
connection between the different layers and the data to be described (e.g. the actual
audio, video, or audio visual stream). For instance, a documentarist may perform a
number of actions in the same time span. Start and end time can be used to identify
the temporal relation between the actions. Documentary pieces can be represented in
this way, thereby enabling semantic retrieval. Figure 6 depicts the layered
representation of a shot of 100 frames, representing three actions. Suppose a query
“find a documentary piece of a natural history documentary from Africa, where
documentarist is speaking and touching a gorilla, while gorilla is eating a banana”.
This question can be easily retrieved by isolating the common parts of the shot as
depicted in shaded portion of Figure 6. The temporal relationship between them can
be identified using the start and end point with which those actions are associated.
In this way, complex structured behavior concepts can be represented and hence the
audio-visual material retrieved on this basis.
Figure 6.
Layered annotation of
actions and isolated
segment of a shot a query
EL
30,5
732
5.2 Nodes of the proposed framework
Nodes are used to build linked data structures concerning documentaries. Each node
contains some data and possibly links to other nodes. A node can be thought of as a
logical placeholder for some data. It is a memory block, which contains some data unit
and perhaps references to other nodes, which in turn contain data and perhaps references
to yet more nodes. Links between nodes are implemented by pointers or references. By
forming chains of interlinked nodes, very large and complex data structures concerning
documentaries can be formed. As a consequence, semantic structures of documentary’s
pieces can be implemented easily. In our framework, we distinguish two types of nodes,
i.e. data nodes (D-nodes) and conceptual annotation nodes (CA-nodes):
(1) A D-node represents physical audio-visual material of any media type, such as
text, audio, video, 3D animation, 2D image, 3D image, and graphic. The size,
duration, and technical format of the material is not restricted, nor are any
limitations present with respect to the content, i.e. number of persons, actions
and objects. A data node might contain a complete documentary film or merely
a scene. The identification of the node is realised via a URI.
(2) A CA-node provides high-level descriptions of a video documentary. A
high-level description is one that describes “top-level” goals, overall features of
a documentary, is more abstracted, and is typically more concerned with the
video documentary as a whole, and its goals. For example, the events occur in a
documentary (as well as the location, date and time of an event) can be
described by high-level descriptors. The mood (e.g. subjective content-happy,
sorrow, romantic etc) of a documentary and so many other features can also be
described by high-level descriptors. Such descriptors are usually difficult to
retrieve using automatic extraction methods. This type of nodes is usually
created manually.
Each node is best understood as an instantiated schema. The available number of node
schemata is restricted, thus indexing and classification can be performed in a
controlled way, whereas the number of provided nodes in the descriptional information
space might consist of just one node or up to n nodes. The obvious choice for
representing CA-nodes, each of them describing audiovisual content, would have been
using the DDL of MPEG-7 or suggested schemata by MPEG-7. The MPEG-7 standard
(Martinez et al., 2002; Salembier and Smith, 2002) concentrates on multimedia content
description and constitutes the greatest effort for multimedia description. It is based on
a set of XML Schemas that define 1,182 elements, 417 attributes and 377 complex
types. It is divided into four main components:
(1) the Description Definition Language (DDL, the basic building blocks for the
MPEG-7 metadata language);
(2) audio (the descriptive elements for audio);
(3) visual (those for video); and
(4) the Multimedia Description Schemes (MDS, the descriptors for capturing the
semantic aspects of multimedia contents, e.g. places, documentarists, objects,
events, etc).
Documentary
media objects
733
We do not choose using MPEG-7 because the main weakness of the MPEG-7 standard
is that formal semantics are not included in the definition of the descriptors in a way
that can be easily implementable in a system (Nack et al., 2005). Therefore, we chose to
use XML Schema as a representational scheme for the documentary media due to its
simplicity and maturity. The use of XML technologies implies that a great part of the
semantics remains implicit. Therefore, each time an application is developed;
semantics must be extracted from the standard and re-implemented.
For our documentary media environment, we have developed a set of 14 schemata
that describe the denotative and technical content of the documentary video. The
schemata are designed such a way that they are semi-automatically instantiated or
authored. These are shown in Table II.
The XML schema representation of the 14 schemes can be found in Subsection 5.4.
With these schemes one can perform the browse (e.g. documentary, actions,
documentarists, documentary piece, culture, objects etc) and semantic search (e.g. show
me all natural history documentary pieces).
5.3 Relationships
In our framework, all metadata about the actual audio and video streams of the
documentary are organized in the form of a semantic network. A semantic network is a
network that represents semantic relations among concepts. This is often used as a
form of knowledge representation and it is a directed or undirected graph consisting of
vertices, which represent concepts, and edges. Figure 7 depicts a possible semantic net
of a documentary annotation.
From this figure, we can also understand the two ways of annotating documentary
data, based on the requirements of the documentary expert.
Schema Description
Documentary High-level organizational scheme of a documentary presentation
containing all documentary clips
Documentary Clip High-level scheme of a documentary consisting of all annotations and
relations to other clips
Documentary Piece An event representing a meaningful collection of the actions of
documentarists
Subject/Documentarist’s
Action
The basic pose, gesture or action done by the documentarist
Event The event that occurs in a documentary clip
Person Person participating in a documentary, e.g. documentarists,
interviewees, narrators, speakers
Emotion Subjective content like mood or feeling etc
Setting The location, date time of an event
LifeSpan Duration with start and end times
Relation Between documentary media elements
STRelation Spatial-temporal relationships of the documentarist
Link Connections between the media source and the document schemes
Resource Relation to any URI address
Basic Info Basic information about the documentary such as language, video
type, recording information, archive information, access rights etc
Table II.
Schemata for
documentaries
EL
30,5
734
(1) either as part of a documentary; or
(2) as a single documentary clip representing one documentary.
Annotation networks of a documentary, clip, documentary piece, media source can be
interconnected together with the links and relations. There are two types of
connections among the nodes:
(1) Link type: to connect media source and description nodes (represented using
arrow).
(2) Relation type: to connect different annotation nodes (represented using line).
Link connects the media source (audio and video files) to the data node along with its
life spans (i.e. on a temporal level). The XML schema representation of Link type is
shown below.
5.4 Description schemes for documentaries in XML Schema
The XML schema representation of the relation types is presented hereafter (Figures 8-10).
In our environment, DocumentaryDS and DocumentaryClipDS hold link types,
enabling connections to the documentary video and audio sources. Note that, these two
description schemes serve as an entry point to the semantic network. Our front-end
annotation tool performs the semiautomatic instantiation of links. Relation types
perform the connection among the description schemes that are represented as
CA-nodes. Between two nodes, there may exist up to m relationships and we define the
following relations for our documentary media environment.
.
For events: follows, precedes.
.
For character, setting, object: part of, association, before, equal, meets, overlaps,
during, starts, finishes.
.
For documentary pieces: we propose two temporal semantic relationships for the
documentary pieces: follows and precedes.
These temporal semantic relationships help to infer the type of documentary during
query processing. In our environment, relationships are instantiated
Figure 7.
A semantic net of a
documentary annotation
Documentary
media objects
735
semi-automatically by the tool. We now introduce our documentary annotation and
querying tool to instantiate the description schemes that have been designed based on
the concepts of semantic net. Also, we then introduce our search engine that allows the
users to browse and query the documentary features for composing new
documentaries and for learning purposes.
Figure 8.
EL
30,5
736
Figure 9.
Documentary
media objects
737
Figure 10.
EL
30,5
738
6. Tools for documentaries
6.1 Annotation and authoring tool
Documentary experts can annotate the documentary or clip by looking at the running
video and using the annotation tool. The video player provides all the standard
facilities like play, start, stop, pause and replay. We used the Cinepak codec for the
conversion of the running video (WinAmp media file) to AVI format. The annotation
tool provides to the documentary experts the facility to annotate the documentary
pieces using free-text and controlled vocabulary independently on the storage
organization of the generated annotations. We developed the annotation tool by using
J2SE1.5 and Java Media Framework 2.0. Figure 11 depicts the GUI of the initial screen
for determining the documentary information.
It is noteworthy that a documentary, a documentary clip constitutes an entry point
to the annotation. The annotation process begins by the documentary expert with
describing the metadata about the documentary. The basic metadata (descriptions)
those are common for all documentaries are shown in Table III.
Once the annotation of the documentary has been completed, the documentary
expert can describe individual documentary presentations that are part of that
documentary. We have identified a set of features that correspond to a documentary
clip as depicted in Table IV. The metadata describing a documentary piece that can
be annotated through the annotation tool are as follows (Table V). The metadata
about the person, object and basic media info are shown in Tables VI-VIII,
respectively.
Figure 11.
A snapshot of the
annotation tool for
determining the
documentary information
Documentary
media objects
739
The semi-automated editing suite (Figure 12) provides the documentary expert with an
instant overview of the available material and its essential relations represented
through the spatial order of its presentation. The documentary expert can mark the
relevant video clips or pieces by pointing at the preferred clips or pieces. The order of
pointing indicates the sequential appearance of the clips or pieces. The editing suite
based on a simple planner performs an automated composition of the documentary
clip. At the present stage of development our editing suite uses the meta-information
obtained from the annotation tool to support the video editing process.
Documentary piece Description
MoodID Subjective content-happy, sorrow, romantic, etc
Culture Indian, western, etc documentary pieces
Genre Such as poetic, expository, observational participatory, reflexive,
performative
Mode of documentary speech Commentary speech, presenter speech, interview speech in shot,
overhead interchange, dramatic dialogue
Object Background and foreground objects used in a documentary piece
Action Spatial-temporal actions, gestures, poses of the characters
Agent Body parts involved
Related action Associated action
Target Target body part of the opponent if any
Speed Slow, medium, fast, gradual ascending, gradual descending
Life span Duration of the documentary piece
Table V.
Metadata of a
documentary piece
Documentary clip Description
Character name, role, gender,
life span
Role played by the documentarist such as commentarist, presenter etc.
Life span of the character is necessary. Because several roles by the
same documentarist in a documentary clip are possible
Context Identifies whether it is a historical, travel or documentary without
words etc
Documentary genre Such as poetic, expository, observational participatory, reflexive,
performative
Language Language used by the documentarists in the audio. Several languages
may be used in the same documentary
Life span Duration of the documentary clip
Table IV.
Metadata of a
documentary clip
Documentary Description
Date and time Date and time of video recording of the documentary
Media locator Links to video and audio streams
Media format Format of the video such as mpg, avi etc
Media type Type of the media like video, audio, text etc
Title Name of the documentary
Origin Originating country of the documentary
Duration Life span, i.e. length of the documentary in minutes
Table III.
Metadata of a
documentary
EL
30,5
740
6.2 Search engine
The search engine facilitates the documentary experts to design a new documentary
and users to view the documentary pieces themselves. In particular, user can search in
many dimensions for specific documentary pieces belonging to a video clip. For
example, user can search for all documentary pieces denoting specific objects such as
sun, moon etc. In addition, user can search for certain subject’s actions incorporated
into documentary pieces. Furthermore, user can search for documentary pieces, where
the subject (e.g. documentarist) has certain mood (happy, angered etc). In another case,
user can search for documentary pieces, in which the speed of the delivery of subject’s
actions are low, or medium or fast or gradual ascending or gradual descending. User
can also search for documentary pieces in which a “specific” song is played. Finally,
user can use this search engine as a browsing tool with several built in categories of the
documentary information and as a query tool to pose free text documentary queries.
The retrieval tool facilitates several browsing features for the users. These are:
Documentary To browse all documentary clips along with their video of the
documentary pieces. Output is rendered in the output window.
Documentary clip To view all documentary pieces of a clip.
Documentary piece To view all subject/documentarist actions of a particular clip.
Objects Displays all documentary pieces denoting sun, moon, etc.
Tempo Users can browse the documentary pieces according to the
speed categories.
Person Description
Name Name of the person
Function Commentarist, speaker, interviewee
E-mail Contact details
Table VI.
Metadata of persons
Object Description
Name Name of the background or foreground object
Type Background or foreground object
Number of Number of objects
Shape Shape of the object (in text)
Color Color of the object (in text)
Texture Pattern
Table VII.
Metadata of objects
Basic information Description
Recording speed Speed of recording
Camera details Description of the camera used while recording the documentary
Access rights Access information
Table VIII.
Metadata of media
Documentary
media objects
741
Mood To browse according to the feeling like happy, romantic, etc.
Culture Indian, western, etc.
Documentarist All documentary pieces that are part of a documentarist.
Genre Poetic, expository, observational, participatory, reflexive,
performative, etc.
Speech mode Commentary speech, presenter speech, interview speech in
shot, overhead interchange, dramatic dialogue.
Actions View by specific actions.
Song View documentary pieces of a song.
Documentary users/spectators can submit their documentary queries in the query
window using keywords as free text. For example, consider the query Q: Show me all
pieces of natural history documentaries. Our framework uses a semantic information
retrieval mechanism, which is similar to that presented in Chen et al. (2010). The use of
semantic information, especially which derived from spatio-temporal analysis is of great
value in multimedia annotation, archiving and retrieval. Ren et al. (2009) survey the use
of spatiotemporal semantic knowledge for information-based video retrieval and draw
important conclusions on where future research is headed. Liu and Chen (2009) present a
novel framework for content-based video retrieval. They use an unsupervised learning
Figure 12.
The semi-automated
editing suite for
documentary clips
EL
30,5
742
method to automatically discover and locate the object of interest in a video clip. This
unsupervised learning algorithm alleviates the need for training a large number of object
recognizers. Regional image characteristics are extracted from the object of interest to
form a set of descriptors for each video. A novel ensemble-based matching algorithm
compares the similarity between two videos based on the set of descriptors each video
contains. Videos containing large pose, size, and lighting variations are used to validate
their approach. Finally, Chen et al. (2010) developed a semantic-enable information
retrieval mechanism that handles the processing, recognition, extraction, extensions and
matching of content semantics to achieve the following objectives to:
.
analyze and determine the semantic features of content, to develop a semantic
pattern that represents semantic features of the content, and to structuralize and
materialize semantic features;
.
analyze user’s query and extend its implied semantics through semantic
extension so as to identify more semantic features for matching; and
.
generate contents with approximate semantics by matching against the
extended query to provide correct contents to the querist.
This mechanism is capable of improving the traditional problem of keyword search
and enables the user to perform a semantic-based query and search for the required
information, thereby improving the reusing and sharing of information.
7. Future work: an ontology for video documentaries
Multimedia ontologies (especially MPEG-7-based ontologies) have the potential to
increase the interoperability of applications producing and consuming multimedia
annotations. Hunter (2003) provided the first attempt to model parts of MPEG-7 in
RDFS, later integrated with the ABC model. Tsinaraki et al. (2004) start from the core
of this ontology and extend it to cover the full Multimedia Description Scheme (MDS)
part of MPEG-7, in an OWL DL ontology. Isaac and Troncy (2004) proposed a core
audio-visual ontology inspired by several terminologies such as MPEG-7, TV Anytime
or ProgramGuideML., while Garcia and Celma (2005) produced the first complete
MPEG-7 ontology, automatically generated using a generic mapping from XSD to
OWL. All these methods perform a one to one translation of MPEG-7 types into OWL
concepts and properties. This translation however does not guarantee that the intended
semantics of MPEG-7 is fully captured and formalized. On the contrary, the syntactic
interoperability and conceptual ambiguity problems remain.
A video documentary ontology can increase the interoperability of documentary
authoring tools. It can represent documentary concepts and their relationships that will
help to retrieve the required result. From another perspective, the application of
multimedia reasoning techniques on top of semantic multimedia annotations can enable a
multimedia authoring application more intelligent (Van Ossenbruggen et al., 2004).
Currently, we are engaged in representing the complete media semantics of a documentary
using the Web Ontology Language (OWL) (Smith et al., 2004). We aim to describe the video
documentary ontology. In the near future, we will examine how we can raise the quality of
documentary annotation and improve the usability of content-based video search and
retrieval systems. Figure 13 depicts a portion of our ontology for documentaries.
Documentary
media objects
743
8. Conclusions
Tools for automatically understanding video are required in the documentary domain.
Semantics-based annotations will break the traditional linear manner of accessing and
browsing documentaries and will support vignette-oriented access of audio and video.
In this paper, we have presented a framework for the modeling, annotation, and
retrieval of media documents, applied to the domain of documentary. Using a basic set
of 14 semantic description schemes, we demonstrated how a documentary video can be
annotated and how this information can be used for the retrieval to support
documentary design. We emphasized tools and technologies for the manual annotation
of the documentary media objects. Flexible annotation facilities are required to
facilitate documentary creativity by way of semantic networks because the annotation
process is dynamic and annotations can grow over time. We have proposed a flexible
organization of media content description and the related media data. This
organization requires the adaptable construction in the form of a semantic network.
The proposed concept features three significant functions, which make it suitable as a
platform for supporting the needs of documentary production:
(1) It provides semantic and technical memory structures (i.e. information nodes)
with the capability to change and grow, allowing an ongoing task specific
process of inspection and interpretation of source material.
(2) Our approach facilitates the dynamic use of audio-visual material using links,
enabling the connection from multi-layered information nodes to data on a
temporal, spatial and spatial-temporal level. Moreover, since the description of
media content holds constant for the associated time interval, we are now in the
position to handle multiple content descriptions for the same media unit and
also to handle gaps.
(3) It enables the semantic connection between information nodes using typed
relations, thus structuring the information space on a semantic as well as syntactic
level.
We believe that our approach (audio-visual strategy) can be used for improving training
and education in documentary communication and to this end we have also indicated
future efforts to create an ontology for video documentaries with enhanced annotation.
Figure 13.
A part of the domain
ontology for
documentaries
EL
30,5
744
References
Agius, H. and Angelides, M. (2005), “COSMOS-7: video-oriented MPEG-7 scheme for modeling
and filtering of semantic content”, The Computer Journal, Vol. 48 No. 5, pp. 545-62.
Aguierre Smith, T.G. and Davenport, G. (1992), “The stratification system: a design environment
for random access video”, Proceedings of the ACM Workshop on Networking and
Operating System Support for Digital Audio and Video, San Diego, CA, Lecture Notes in
Computer Science, Vol. 712, Springer, Berlin, pp. 250-61.
Aubert, O., Champin, P.-A., Prie´, Y. and Richard, B. (2008), “Canonical processes in active reading
and hypervideo production”, Multimedia Systems Journal, Vol. 14 No. 6, pp. 427-33.
Barnouw, E. (1993), Documentary: A History of the Non-fiction Film, Oxford University Press,
Oxford.
Benitez, A., Paek, S., Chang, S.-F., Puri, A., Huang, Q., Smith, J., Li, C.-S., Bergman, L. and Judice,
C. (2000), “Object-based multimedia content description schemes and applications for
MPEG-7”, Signal Processing: Image Communication, Vol. 16 Nos 1/2, pp. 235-69.
Bocconi, S., Nack, F. and Hardman, L. (2008), “Automatic generation of matter-of-opinion video
documentaries”, Journal of Web Semantics, Vol. 6, pp. 139-50.
Brachman, R.J. and Levesque, H.J. (1983), Readings in Knowledge Representation, Morgan
Kaufmann, San Mateo, CA.
Burger, T. (2008), “The need for formalizing media semantics in the games and entertainment
industry”, Journal of Universal Computer Science, Vol. 14 No. 10, pp. 1775-91.
Chakravarthy, A., Ciravegna, F. and Lanfranchi, V. (2006), “Cross-media document annotation
and enrichment”, Proceedings of the 1st Semantic Authoring and Annotation Workshop
(SAAW 2006), Athens, GA, November 6.
Chakravarthy, A., Beales, R., Matskanis, N. and Yang, X. (2009), “OntoFilm: a core ontology for
film production”, in Chua, T.-S., Kompatsiaris, Y., Me´rialdo, B., Haas, W., Thallinger, G.
and Bailer, W. (Eds), Proceedings of the 4th International Conference on Semantic and
Digital Media Technologies (SAMT 2009), Lecture Notes in Computer Science, Vol. 5887,
Springer, Berlin, pp. 177-81.
Chen, M.-Y., Chu, H.-C. and Chen, Y.M. (2010), “Developing a semantic-enable information
retrieval mechanism”, Expert Systems with Applications, Vol. 37 No. 1, pp. 322-40.
Choi, I. (2010), “From tradition to emerging practice: a hybrid computational production model
for interactive documentary”, Entertainment Computing, Vol. 1 Nos 3/4, pp. 105-17.
Costa, M., Correia, N. and Guimaraes, N. (2002), “Annotations as multiple perspectives of video
content”, Proceedings of the ACM Conference on Multimedia, San Francisco, CA,
2-7 November, pp. 283-6.
Dorai, C. and Venkatesh, S. (2001), “Computational media aesthetics: finding meaning beautiful”,
IEEE Multimedia, Vol. 8 No. 4, pp. 10-12.
Dorai, C., Mauthe, A., Nack, F., Rutledge, L., Sikora, T. and Zettl, H. (2002), “Media semantics: who
needs it and why?”, Proceedings of Multimedia ’02, December 1-6, Juan-les-Pins, pp. 580-3.
Eco, U. (1997), A Theory of Semiotics, Macmillan, London.
Garcia, R. and Celma, O. (2005), “Semantic integration and retrieval of multimedia metadata”,
Proceedings of the Fifth International Workshop on Knowledge Markup and Semantic
Annotation, 7 November, Galway.
Geurts, J., Bocconi, S., van Ossenbruggen, J. and Hardman, L. (2003), “Towards ontology-driven
discourse: from semantic graphs to multimedia presentations”, in Fensel, D., Sycara, K.
and Mylopoulos, J. (Eds), Proceedings of the Second International Semantic Web
Conference (ISWC 2003), Sanibel Island, FL, 20-23 October, Springer, Berlin.
Documentary
media objects
745
Hardman, L., Obrenovic, Zˇ., Nack, F., Kerherve´, B. and Piersol, K. (2008), “Canonical processes of
semantically annotated media production”, Multimedia Systems, Vol. 14, pp. 327-40.
Harry, W.A. and Angelides, M.C. (2001), “Modeling content for semantic level querying of
multimedia”, Multimedia Tools and Applications, Vol. 15 No. 1, pp. 5-37.
Hunter, J. (2003), “Enhancing the semantic interoperability of multimedia through a core
ontology”, IEEE Transactions: Circuits and Systems for Video Technology, Vol. 13 No. 1,
pp. 49-58.
IBM (n.d.), “alphaWorks community, VideoAnnEx annotation tool”, available at: www.
alphaworks.ibm.com/tech/videoannex
IMDb (n.d.), “World Travels”, available at: www.imdb.com/title/tt1392723/
Isaac, A. and Troncy, R. (2004), “Designing and using an audio-visual description core ontology”,
paper presented at the Workshop on Core Ontologies in Ontology Engineering, 5-8
October, Whittlebury.
Korsakow (n.d.), “Korsakow system”, available at: www.korsakow.com/ksy/index.html
Little, S., Geurts, J. and Hunter, J. (2002), “Dynamic generation of intelligent multimedia
presentations through semantic inferencing”, Proceedings of the 6th European Conference
on Research and Advanced Technology for Digital Libraries, Pontifical Gregorian
University, Rome, Springer, Berlin.
Liu, D. and Chen, T. (2009), “Video retrieval based on object discovery”, Computer Vision and
Image Understanding, Vol. 113 No. 3, pp. 397-404.
Martinez, J., Koenen, R. and Pereira, F. (2002), “MPEG-7 – The generic multimedia content
description standard Part 1”, IEEE MultiMedia Magazine, Vol. 9 No. 2, pp. 78-87.
Mateas, M. (2000), “Generation of ideologically-biased historical documentaries”, Proceedings of
the 17th National Conference on Artificial Intelligence and Innovative Applications of
Artificial Intelligence Conference (AAAI-00), Austin, TX, pp. 36-42.
Murtaugh, M. (1996), “The automatist storytelling system”, PhD thesis, Massachusetts Institute
of Technology, available at: http://alumni.media.mit.edu/,murtaugh/thesis/
Nack, F. and Parkes, A. (1997), “Towards the automated editing of theme-oriented video
sequences”, Applied Artificial Intelligence, Vol. 11 No. 4, pp. 331-66.
Nack, F. and Putz, W. (2004), “Saying what it means: semi-automated (news) media annotation”,
Multimedia Tools and Applications, Vol. 22 No. 3, pp. 263-302.
Nack, F., Ossenbruggen, J.v. and Hardman, L. (2005), “That obscure object of desire: multimedia
metadata on the web (Part II)”, IEEE Multimedia, Vol. 12 No. 1, pp. 54-63.
Nichols, B. (2001), “What types of documentary are there?”, Introduction to Documentary,
Indiana University Press, Bloomington, IN, pp. 99-138.
Parkes, A.P. (1989), “Settings and the settings structure: the description and automated
propagation of networks for perusing videodisk image states”, in Belkin, N.J. and
Rijsbergen, C.J. (Eds), Proceedings of SIG Information Retrieval ’89, Cambridge, MA, ACM
Press, New York, NY, pp. 229-38.
Pereira, F., Vetro, A. and Sikora, T. (2008), “Multimedia retrieval and delivery: essential metadata
challenges and standards”, Proceedings of the IEEE, Vol. 96 No. 4, pp. 721-44.
Ramadoss, B. and Rajkumar, K. (2007), “Semi-automated annotation and retrieval of dance media
objects”, Cybernetics and Systems, Vol. 38 No. 4, pp. 349-79.
Ren, W., Singh, S., Singh, M. and Zhu, Y.S. (2009), “State-of-the-art on spatio-temporal
information-based video retrieval”, Pattern Recognition, Vol. 42 No. 2, pp. 267-82.
Rijksmuseum (n.d.), available at: www.rijksmuseum.nl
EL
30,5
746
Rincon, M. and Martinez-Cantos, J. (2007), “An annotation tool for video understanding”,
in Moreno-Dı´az, R., Pichler, F. and Quesada Arencibia, A. (Eds), Proceedings of the
11th International Conference on Computer Aided Systems Theory and Technology
(EUROCAST 2007), Las Palmas, 12-16 February, Lecture Notes in Computer Science,
Vol. 4739, Springer, Berlin, pp. 701-8.
Rosenthal, A. and Corner, J. (2005), New Challenges for Documentary, 2nd ed., Manchester
University Press, Manchester.
Ryn, J., Sohn, Y. and Kin, M. (1989), “MPEG-7 metadata authoring tool”, Proceedings of the ACM
Conference on Multimedia, pp. 267-70.
Salembier, P. and Smith, J. (2002), “Overview of MPEG-7 multimedia description schemes and
schema tools”, in Manjunath, B.S., Salembier, P. and Sikora, T. (Eds), Introduction to
MPEG-7: Multimedia Content Description Interface, Wiley, Chichester.
Sarkar, A. and Tripasai, W. (2002), “Learning verb argument structure from minimally
annotated corpora”, Proceedings of the 19th International Conference on Computational
Linguistics, August 24-September Vol. 1, Taipei, pp. 1-7.
Smith, M.K., Welty, C. and McGuinness, D.L. (2004), “OWL web ontology language,
W3C recommendation”, available at: www.w3c.org/TR/owl-guide/
Soft Cinema (n.d.), available at: www:softcinema.net
Sourcegorge.net (n.d.), “VIPER-GT annotation tool”, available at: http://viper-toolkit.
sourcegorge.net
The Language Archive (n.d.), “ELAN annotation tool”, available at: www.lat-mpi.eu/tools/elan
Tien, T.T. and Cecile, R. (2003), “Multimedia modeling using MPEG-7 for authoring multimedia
integration”, Proceedings of the ACM Conference on Multimedia Information Retrieval,
pp. 171-8.
Tsinaraki, C., Polydoros, P. and Christodoulakis, S. (2004), “Integration of OWL ontologies in
MPEG-7 and TVAnytime compliant semantic indexing”, Proceedings of the 16th
International Conference on Advanced Information Systems Engineering (CAiSE 2004),
Riga, June 7-11, pp. 143-61.
(The) University of Edingurgh (n.d.), “CAVIAR: Context Aware Vision using Image-based
Active Recognition”, available at: http://homepages.inf.ed.ac.uk/rbf/CAVIAR/
Van Ossenbruggen, J., Nack, F. and Hardman, L. (2004), “That obscure object of desire:
multimedia metadata on the Web (Part I)”, IEEE Multimedia, Vol. 11 No. 4, pp. 38-48.
About the author
Dimitris Kanellopoulos holds a PhD in multimedia communications from the Department of
Electrical and Computer Engineering of the University of Patras, Greece. He is a member of the
Educational Software Development Laboratory in the Department of Mathematics at the
University of Patras. His research interests include multimedia communications, knowledge
representation, intelligent systems, and Web engineering. He has authored many papers in
international journals and conferences at these areas. He serves as a member of the editorial
boards in ten academic journals. Dimitris Kanellopoulos can be contacted at:
d_kan2006@yahoo.gr
Documentary
media objects
747
To purchase reprints of this article please e-mail: reprints@emeraldinsight.com
Or visit our web site for further details: www.emeraldinsight.com/reprints
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

More Related Content

Similar to Reference 5

AUTHENTICITY AND OAIS. THE CASPAR MODEL AND THE INTERPARES PRINCIPLES & OUTPUTS
AUTHENTICITY AND OAIS.THE CASPAR MODEL AND THE INTERPARES PRINCIPLES & OUTPUTSAUTHENTICITY AND OAIS.THE CASPAR MODEL AND THE INTERPARES PRINCIPLES & OUTPUTS
AUTHENTICITY AND OAIS. THE CASPAR MODEL AND THE INTERPARES PRINCIPLES & OUTPUTSDigitalPreservationEurope
 
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE ijmpict
 
IRJET- Segmenting, Multimedia Summarizing and Query based Retrieval of New...
IRJET- 	  Segmenting, Multimedia Summarizing and Query based Retrieval of New...IRJET- 	  Segmenting, Multimedia Summarizing and Query based Retrieval of New...
IRJET- Segmenting, Multimedia Summarizing and Query based Retrieval of New...IRJET Journal
 
Ontologies dynamic networks of formally represented meaning1
Ontologies dynamic networks of formally represented meaning1Ontologies dynamic networks of formally represented meaning1
Ontologies dynamic networks of formally represented meaning1STIinnsbruck
 
Multimedia information retrieval using artificial neural network
Multimedia information retrieval using artificial neural networkMultimedia information retrieval using artificial neural network
Multimedia information retrieval using artificial neural networkIAESIJAI
 
Personalized Real-Time Virtual Tours in Places with Cultural Interest
Personalized Real-Time Virtual Tours in Places with Cultural InterestPersonalized Real-Time Virtual Tours in Places with Cultural Interest
Personalized Real-Time Virtual Tours in Places with Cultural InterestUniversity of Piraeus
 
Irina Pata_VIDEO LANDSCAPE. FILM AS A TOOL FOR LANDSCAPE ANALYSIS AND [RE]PRE...
Irina Pata_VIDEO LANDSCAPE. FILM AS A TOOL FOR LANDSCAPE ANALYSIS AND [RE]PRE...Irina Pata_VIDEO LANDSCAPE. FILM AS A TOOL FOR LANDSCAPE ANALYSIS AND [RE]PRE...
Irina Pata_VIDEO LANDSCAPE. FILM AS A TOOL FOR LANDSCAPE ANALYSIS AND [RE]PRE...Irina Pata
 
Video-based social network analysis data collection in sport -Mariusz Karbowski
Video-based social network analysis data collection in sport -Mariusz Karbowski Video-based social network analysis data collection in sport -Mariusz Karbowski
Video-based social network analysis data collection in sport -Mariusz Karbowski BIZNES SOCIAL NETWORK ANALYSIS
 
Towards the digital_archiving_sysytem_for_field_ar (1)
Towards the digital_archiving_sysytem_for_field_ar (1)Towards the digital_archiving_sysytem_for_field_ar (1)
Towards the digital_archiving_sysytem_for_field_ar (1)Nadeeka Rathnabahu
 
Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2Smita Chandra
 
IRJET- Multimedia Summarization and Retrieval of News Broadcast
IRJET- Multimedia Summarization and Retrieval of News BroadcastIRJET- Multimedia Summarization and Retrieval of News Broadcast
IRJET- Multimedia Summarization and Retrieval of News BroadcastIRJET Journal
 
Exploring the Effect of Web Based Communications on Organizations Service Qua...
Exploring the Effect of Web Based Communications on Organizations Service Qua...Exploring the Effect of Web Based Communications on Organizations Service Qua...
Exploring the Effect of Web Based Communications on Organizations Service Qua...IOSR Journals
 
October 2023-Top Cited Articles in IJU.pdf
October 2023-Top Cited Articles in IJU.pdfOctober 2023-Top Cited Articles in IJU.pdf
October 2023-Top Cited Articles in IJU.pdfijujournal
 
Artigo - Aplicações Interativas para TV Digital: Uma Proposta de Ontologia de...
Artigo - Aplicações Interativas para TV Digital: Uma Proposta de Ontologia de...Artigo - Aplicações Interativas para TV Digital: Uma Proposta de Ontologia de...
Artigo - Aplicações Interativas para TV Digital: Uma Proposta de Ontologia de...Diego Armando
 
A Distributed Audio Personalization Framework over Android
A Distributed Audio Personalization Framework over AndroidA Distributed Audio Personalization Framework over Android
A Distributed Audio Personalization Framework over AndroidUniversity of Piraeus
 

Similar to Reference 5 (20)

AUTHENTICITY AND OAIS. THE CASPAR MODEL AND THE INTERPARES PRINCIPLES & OUTPUTS
AUTHENTICITY AND OAIS.THE CASPAR MODEL AND THE INTERPARES PRINCIPLES & OUTPUTSAUTHENTICITY AND OAIS.THE CASPAR MODEL AND THE INTERPARES PRINCIPLES & OUTPUTS
AUTHENTICITY AND OAIS. THE CASPAR MODEL AND THE INTERPARES PRINCIPLES & OUTPUTS
 
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
 
IRJET- Segmenting, Multimedia Summarizing and Query based Retrieval of New...
IRJET- 	  Segmenting, Multimedia Summarizing and Query based Retrieval of New...IRJET- 	  Segmenting, Multimedia Summarizing and Query based Retrieval of New...
IRJET- Segmenting, Multimedia Summarizing and Query based Retrieval of New...
 
Ontologies dynamic networks of formally represented meaning1
Ontologies dynamic networks of formally represented meaning1Ontologies dynamic networks of formally represented meaning1
Ontologies dynamic networks of formally represented meaning1
 
Multimedia information retrieval using artificial neural network
Multimedia information retrieval using artificial neural networkMultimedia information retrieval using artificial neural network
Multimedia information retrieval using artificial neural network
 
Personalized Real-Time Virtual Tours in Places with Cultural Interest
Personalized Real-Time Virtual Tours in Places with Cultural InterestPersonalized Real-Time Virtual Tours in Places with Cultural Interest
Personalized Real-Time Virtual Tours in Places with Cultural Interest
 
IVACS 2010
IVACS 2010IVACS 2010
IVACS 2010
 
Irina Pata_VIDEO LANDSCAPE. FILM AS A TOOL FOR LANDSCAPE ANALYSIS AND [RE]PRE...
Irina Pata_VIDEO LANDSCAPE. FILM AS A TOOL FOR LANDSCAPE ANALYSIS AND [RE]PRE...Irina Pata_VIDEO LANDSCAPE. FILM AS A TOOL FOR LANDSCAPE ANALYSIS AND [RE]PRE...
Irina Pata_VIDEO LANDSCAPE. FILM AS A TOOL FOR LANDSCAPE ANALYSIS AND [RE]PRE...
 
Getaneh Alemu
Getaneh AlemuGetaneh Alemu
Getaneh Alemu
 
Video-based social network analysis data collection in sport -Mariusz Karbowski
Video-based social network analysis data collection in sport -Mariusz Karbowski Video-based social network analysis data collection in sport -Mariusz Karbowski
Video-based social network analysis data collection in sport -Mariusz Karbowski
 
On Annotation of Video Content for Multimedia Retrieval and Sharing
On Annotation of Video Content for Multimedia  Retrieval and SharingOn Annotation of Video Content for Multimedia  Retrieval and Sharing
On Annotation of Video Content for Multimedia Retrieval and Sharing
 
ICAME 2010
ICAME 2010ICAME 2010
ICAME 2010
 
Towards the digital_archiving_sysytem_for_field_ar (1)
Towards the digital_archiving_sysytem_for_field_ar (1)Towards the digital_archiving_sysytem_for_field_ar (1)
Towards the digital_archiving_sysytem_for_field_ar (1)
 
Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2
 
IRJET- Multimedia Summarization and Retrieval of News Broadcast
IRJET- Multimedia Summarization and Retrieval of News BroadcastIRJET- Multimedia Summarization and Retrieval of News Broadcast
IRJET- Multimedia Summarization and Retrieval of News Broadcast
 
Exploring the Effect of Web Based Communications on Organizations Service Qua...
Exploring the Effect of Web Based Communications on Organizations Service Qua...Exploring the Effect of Web Based Communications on Organizations Service Qua...
Exploring the Effect of Web Based Communications on Organizations Service Qua...
 
October 2023-Top Cited Articles in IJU.pdf
October 2023-Top Cited Articles in IJU.pdfOctober 2023-Top Cited Articles in IJU.pdf
October 2023-Top Cited Articles in IJU.pdf
 
ITIICCCS_2016_paper
ITIICCCS_2016_paperITIICCCS_2016_paper
ITIICCCS_2016_paper
 
Artigo - Aplicações Interativas para TV Digital: Uma Proposta de Ontologia de...
Artigo - Aplicações Interativas para TV Digital: Uma Proposta de Ontologia de...Artigo - Aplicações Interativas para TV Digital: Uma Proposta de Ontologia de...
Artigo - Aplicações Interativas para TV Digital: Uma Proposta de Ontologia de...
 
A Distributed Audio Personalization Framework over Android
A Distributed Audio Personalization Framework over AndroidA Distributed Audio Personalization Framework over Android
A Distributed Audio Personalization Framework over Android
 

More from amirahjuned

Qualitative Journal 2
Qualitative Journal 2Qualitative Journal 2
Qualitative Journal 2amirahjuned
 
Qualitative Journal 1
Qualitative Journal 1Qualitative Journal 1
Qualitative Journal 1amirahjuned
 
Quantitative Journal 3
Quantitative Journal 3Quantitative Journal 3
Quantitative Journal 3amirahjuned
 
Quantitative Journal 2
Quantitative Journal 2Quantitative Journal 2
Quantitative Journal 2amirahjuned
 
Quantitative Journal 1
Quantitative Journal 1Quantitative Journal 1
Quantitative Journal 1amirahjuned
 
Research proposal (presentation 1)
Research proposal (presentation 1)Research proposal (presentation 1)
Research proposal (presentation 1)amirahjuned
 
Research proposal Presentation 1
Research proposal Presentation 1Research proposal Presentation 1
Research proposal Presentation 1amirahjuned
 
Research Proposal Presentation 1
Research Proposal Presentation 1Research Proposal Presentation 1
Research Proposal Presentation 1amirahjuned
 
Draft proposal (chapter 2)
Draft proposal (chapter 2)Draft proposal (chapter 2)
Draft proposal (chapter 2)amirahjuned
 
W alter j , lansu
W alter j , lansuW alter j , lansu
W alter j , lansuamirahjuned
 
Example of journal
Example of journalExample of journal
Example of journalamirahjuned
 
A study of teaching listening to
A study of teaching listening toA study of teaching listening to
A study of teaching listening toamirahjuned
 
The effect of films with and without subtitles on listening
The effect of films with and without subtitles on listeningThe effect of films with and without subtitles on listening
The effect of films with and without subtitles on listeningamirahjuned
 

More from amirahjuned (20)

Reference 6
Reference 6Reference 6
Reference 6
 
Reference 4
Reference 4Reference 4
Reference 4
 
Reference 3
Reference 3Reference 3
Reference 3
 
Reference 2
Reference 2Reference 2
Reference 2
 
Reference 1
Reference 1Reference 1
Reference 1
 
Qualitative Journal 2
Qualitative Journal 2Qualitative Journal 2
Qualitative Journal 2
 
Qualitative Journal 1
Qualitative Journal 1Qualitative Journal 1
Qualitative Journal 1
 
Quantitative Journal 3
Quantitative Journal 3Quantitative Journal 3
Quantitative Journal 3
 
Quantitative Journal 2
Quantitative Journal 2Quantitative Journal 2
Quantitative Journal 2
 
Quantitative Journal 1
Quantitative Journal 1Quantitative Journal 1
Quantitative Journal 1
 
Research proposal (presentation 1)
Research proposal (presentation 1)Research proposal (presentation 1)
Research proposal (presentation 1)
 
Research proposal Presentation 1
Research proposal Presentation 1Research proposal Presentation 1
Research proposal Presentation 1
 
Research Proposal Presentation 1
Research Proposal Presentation 1Research Proposal Presentation 1
Research Proposal Presentation 1
 
Draft proposal (chapter 2)
Draft proposal (chapter 2)Draft proposal (chapter 2)
Draft proposal (chapter 2)
 
Ching kun hsu1
Ching kun hsu1Ching kun hsu1
Ching kun hsu1
 
W alter j , lansu
W alter j , lansuW alter j , lansu
W alter j , lansu
 
Draft proposal
Draft proposalDraft proposal
Draft proposal
 
Example of journal
Example of journalExample of journal
Example of journal
 
A study of teaching listening to
A study of teaching listening toA study of teaching listening to
A study of teaching listening to
 
The effect of films with and without subtitles on listening
The effect of films with and without subtitles on listeningThe effect of films with and without subtitles on listening
The effect of films with and without subtitles on listening
 

Reference 5

  • 1. Semantic annotation and retrieval of documentary media objects Dimitris Kanellopoulos Educational Software Development Laboratory, Department of Mathematics, University of Patras, Rio Patras, Greece Abstract Purpose – This paper aims to propose a system for the semantic annotation of audio-visual media objects, which are provided in the documentary domain. It presents the system’s architecture, a manual annotation tool, an authoring tool and a search engine for the documentary experts. The paper discusses the merits of a proposed approach of evolving semantic network as the basis for the audio-visual content description. Design/methodology/approach – The author demonstrates how documentary media can be semantically annotated, and how this information can be used for the retrieval of the documentary media objects. Furthermore, the paper outlines the underlying XML schema-based content description structures of the proposed system. Findings – Currently, a flexible organization of documentary media content description and the related media data is required. Such an organization requires the adaptable construction in the form of a semantic network. The proposed approach provides semantic structures with the capability to change and grow, allowing an ongoing task-specific process of inspection and interpretation of source material. The approach also provides technical memory structures (i.e. information nodes), which represent the size, duration, and technical format of the physical audio-visual material of any media type, such as audio, video and 3D animation. Originality/value – The proposed approach (architecture) is generic and facilitates the dynamic use of audio-visual material using links, enabling the connection from multi-layered information nodes to data on a temporal, spatial and spatial-temporal level. It enables the semantic connection between information nodes using typed relations, thus structuring the information space on a semantic as well as syntactic level. Since the description of media content holds constant for the associated time interval, the proposed system can handle multiple content descriptions for the same media unit and also handle gaps. The results of this research will be valuable not only for documentary experts but for anyone with a need to manage dynamically audiovisual content in an intelligent way. Keywords Documentary, Semantic annotation, Video, Temporal and spatial levels of audiovisual data, Content management, Audiovisual media, Multimedia Paper type Research paper 1. Introduction In the last few years, the general public’s interest in documentaries has grown enormously. A documentary is the presentation of factual events, often consisting of footage recorded at the time and place of their occurrence and generally accompanied by a narrator (Rosenthal and Corner, 2005). Documentary is a media work category, applied to photography, film and television. It has been developed internationally across a wide range of formats, including the use of dramatization, observational sequences and various combinations of interview material with images that portray the real with deferent degrees of referentiality and aesthetic crafting. Documentaries often depict various important topics (e.g. animal life, historical events, tourist attractions etc) by The current issue and full text archive of this journal is available at www.emeraldinsight.com/0264-0473.htm Documentary media objects 721 Received October 2011 Revised February 2012 Accepted March 2012 The Electronic Library Vol. 30 No. 5, 2012 pp. 721-747 q Emerald Group Publishing Limited 0264-0473 DOI 10.1108/02640471211275756
  • 2. mixing photos and videos with commentaries and opinions from experts. All these elements are organized in narrative form. The definition of documentary often undertakes a discursive path. Two factors play consistently in various definitions: (1) reality is captured in some forms of documents; and (2) the documents are subjected to assemblage to serve a larger context. For the definition of documentary, we adopt the simplest task definition, that of Vertov: “to capture fragments of reality and combine them meaningfully” (Barnouw, 1993, p. 55). It can be said that making documentaries is not a piece of science. Documentaries can relate data from science, but they are not scientific reports. They mix science, narrative, images, while the filmmakers’ point of view affects the way these are mixed. For example, a travel documentary is a documentary film (or television program) that describes travel or tourist attractions in a non-commercial way. It is not a scientific report but it is based on knowledge about tourist attractions. A representative travel documentary is Word Travels (IMDb, n.d.) that follows the lives of two young professional travel writers (Robin Esrock and Julia Dimon), as they journey around the world in search of stories to experience, write about, and file for their editors. According to Nichols (2001) in documentary film and video, we can identify six modes of representation that function something like sub-genres of the documentary film genre itself: poetic, expository, participatory, observational, reflexive, and performative. Table I shows the main characteristics and deficiencies of these documentary modes. Modern lightweight digital video cameras and computer based-editing have really aided documentary makers. The first film to take full advantage of this change was Martin Kunert and Eric Manes’ Voices of Iraq, where 150 digital video cameras were sent to Iraq during the war and passed out to Iraqis to record themselves. Multimedia technology allows text, graphics, photos, and audio to be transmitted effectively and Documentary mode Main characteristics Deficiencies Poetic documentary (1920s) Reassemble fragments of the world poetically Lack of specificity, too abstract Expository documentary (1920s) Directly address issues in the historical world Overly didactic Observational documentary (1960s) Eschew commentary and reenactment; observe things as they happen Lack of history, context Participatory documentary (1960s) Interview or interact with subjects; use archival film to retrieve history Excessive faith in witnesses, naive history, too intrusive Reflexive documentary (1980s) Question documentary form, defamiliarize the other modes Too abstract, lose sight of actual issues Performative documentary (1980s) Stress subjective aspects of a classically objective discourse Loss of emphasis on objectivity may relegate such films to the avant-garde; “excessive” use of style Table I. Documentary modes EL 30,5 722
  • 3. rapidly across media platforms. Media organizations must cope with multimedia changes that move exponentially to the next competing delivery device. Nowadays, there is a potentially wide range of applications in the media domain such as search, filtering of information, media understanding (surveillance, intelligent vision, smart cameras etc.) or media conversions (speech to text, picture to speech, visual transcoding etc). Understanding semantics and meaning of documentaries is directly needed (Choi, 2010). Finding the bits of interest (the important part of a documentary) becomes increasingly difficult, frustrating, and a time consuming task. Internet users need an intelligent search engine for performing complex media search and help users finding media chunks based on semantics in media itself (Dorai et al., 2002). However, media is so rich in its content variety that it will never sufficiently be described by text or words (Dorai and Venkatesh, 2001). Besides, humans must take the time to annotate the media chunks. Media information systems for documentaries should incorporate mechanisms that interpret, manipulate and generate visual media as well as audible information. A media infrastructure for documentaries should manipulate self-sufficient components of documentaries, which can be used in any given production. In order to use such an independent media item, it is required to extract the relationship between the signs of the audio-visual information unit and the semantics they represent (Eco, 1997). As a result, media information systems for documentaries such as Terminal_Time (Mateas, 2000) should manage independent media objects and their representations for use in many different productions. Therefore, we need tools that utilize human actions to extract the important syntactic, semantic and semiotics aspects of its content (Brachman and Levesque, 1983) in order descriptions (based on a formal language) can be constructed. The increasing amount of various documentaries and their combinatorial use requires the annotation of media during their production. Media annotation and querying for documentaries is still a major challenge, as the gap between the documentary features and the existing media tools is wide. In the last two decades, many authoring tools have been proposed for multimedia data (Tien and Cecile, 2003; Ryn et al., 1989). These authoring tools are either application dependent or provide insufficient authoring features. High-level annotation facilities like annotation of objects, time, location, events etc can be provided by existing video annotation tools such as Vannotator (Costa et al., 2002), IBM VideoAnnEx (IBM, n.d.), ELAN (The Language Archive, n.d.), CAVIAR (The University of Edinburgh, n.d.), and ViPER-GT (Sourcegorge.net, n.d.). Rincon and Martinez-Cantos (2007) describe a video annotation tool (called AVISA) for video understanding. They analyze the features that must be present in a video annotation tool for video understanding. However, these features need to be complemented with finer level annotation methods that are required for the video documentaries. Automatic video generation systems use descriptions (annotations) of the media items in order to make decisions about how to create a video sequence. The structure of annotations is composed of two parts: (1) The structure of the description (e.g. a documentary film can be described by fields, such as title, director). (2) The structure of the values used to fill the description (e.g. “The Civil War” can be the value of the field title). According to Bocconi et al. (2008) there are three different types of description structures: Documentary media objects 723
  • 4. (1) Keywords-based description structures (or K-annotations), in which each item is associated with a list of words that represent the item’s content. Representative video generation systems that use K-annotations are Lev Manovich’s Soft Cinema (n.d.) and the Korsakow System (Korsakow, n.d.) , systems that edit in real-time by selecting media items from a database. ConTour (Murtaugh, 1996) is another indicative system that supports evolving documentaries, i.e. documentaries that could incorporate new media items as soon as they were made. (2) Properties-based description schemes (or P-annotations) in which items are annotated with property-value pairs. Representative system of this category is SemInfo (Little et al., 2002). (3) Structure-based on relations (or R-annotations). Here, items are annotated with property-value pairs as in P-annotations only that some of these values are references to other annotations. A representative system is DISC (Geurts et al., 2003), which is a multimedia presentation generation system for the domain of cultural heritage. DISC uses the annotated multimedia repository of the Rijksmuseum (n.d.) to create multimedia presentations. Benitez et al. (2000) presented description schemes (DSs) for image, video, multimedia, home media, and archive content proposed to the MPEG-7 standard. They used the XML to illustrate and exemplify their description schemes by presenting applications that already use the proposed structures. These applications are the visual apprentice, the AMOS-search system, a multimedia broadcast news browser, a storytelling system, and an image meta-search engine, MetaSEEk. The AUTEUR system (Nack and Parkes, 1997) synchronizes automatic story generation for visual media with the stylistic requirements of narrative and medium related presentation. The AUTEUR system consists of an ontological representation of narrative elements such as actions, events, and emotional and visual codes, based on a semantic net of conceptual structures related via six types of semantic links (e.g. synonym, sub-action, opposition, ambiguity, association, conceptual). A coherent action-reaction dynamic is provided by the introduction of three event phases, i.e. motivation, realization and resolution. The essential categories for the structures are action, character, object, relative position, screen position, geographical space, functional space and time. The textual representation of this ontology describes semantic, temporal and relational features of video in hierarchically organized structures, which overcomes the limitations of keyword-based approaches. We believe that formal semantics can support the annotation, analysis, retrieval or reasoning about multimedia assets in the documentary industry. The proliferation of documentaries and their applications require media annotation that bridges the gap between documentary technology and media semantics. In line with this, Dorai and Venkatesh (2001, p. 10) state: A serious need exists to develop algorithms and technologies that can annotate content with deep semantics and establish semantic connections between media’s form and function, for the first time letting users access indexed media and navigate content in unforeseeable and surprising ways. The aim of this paper is to propose an agent-oriented programming approach using a framework for describing the inherent semantics of the documentaries pieces. In EL 30,5 724
  • 5. agent-oriented programming, agent-oriented objects typically have just one method, with a single parameter. This parameter is a sort of message that is interpreted by the receiving object, or “agent”, in a way specific to that object or class of objects. Documentaries pieces are unique to video documentaries. For this reason, we have created a domain specific representation for the documentary pieces to improve the retrieval accuracy of the documentary video queries. The remainder of the paper is structured as follows. In Section 2, we discuss issues concerning documentary authoring, while in Section 3 we present the semantics of documentary media. In Section 4 we describe the system architecture. In Section 5, we present our approach for implementing the repository for documentaries; our semantic network based approach for the data storage and management and we illustrate the proposed XML schema-based representational structures. In Section 6, we explain the use of the proposed system through the tools for annotation, semi-automatic authoring and semantic retrieval that we have implemented for the documentary video environments. Finally, in Section 7 we conclude the paper and give directions for further work. 2. Documentary authoring The conventional understanding of documentary production involves a three-phase workflow: (1) pre-production; (2) production; and (3) post-production. Figure 1 illustrates a traditional documentary production model. The production model formalizes a cyclic process as opposed to a linear workflow. Pre-production is a phase of research and ideation where visions are selectively audited through sketches mostly in text and graphical forms. Production and Post-production are the phases of iterative processes for gathering and assessing media resources. Screening is a main method for assessment through daily production and plays an important role in assessments of daily results and edited sequences, determining further materials needed and methods for acquiring the materials. In particular, a documentary screening is the displaying of a documentary referring to a special showing as part of a documentary’s production and release cycle. The different types of screenings follow here in their order within a documentary’s development: (1) Test screening. For early edits of a documentary, informal test screenings are shown to small target audiences to judge if a documentary will require editing, reshooting or rewriting. (2) Focus group screenings are formal test screenings of a documentary with very detailed documentation of audience responses. (3) Critic screenings are held for national and major market critics well in advance of print and television production-cycle deadlines, and are usually by invitation only. (4) Public preview screenings may serve as final test screenings used to adjust marketing strategy (radio and TV promotion, etc) or the documentary itself. (5) A sneak preview is an unannounced documentary screening before formal release, generally with the usual charge for admission. Documentary media objects 725
  • 6. Actually, media production for documentaries is a complex, resource demanding process that provides a multidimensional network of relationships among the multimedia information. Documentary authoring is based on the fundamental processes of media or hypervideo production. Aubert et al. (2008) identified these fundamental (or canonical) processes that can be supported in semantically aware media production tools. According to Aubert et al. (2008) these processes are: . Premeditate (1) Inscription of marks/organization/browsing. The premeditate process takes place in every step of the authoring activity. Input: thoughts of the author. Output: necessary schemas, annotations, queries or views. . Create (2) This process exploits existing audiovisual documents. . Package (3) Inscription of marks/organization/browsing. The metadata structure and accompanying queries and views are present, and can be materialized package. . Annotate (4) Inscription of marks. Creation of the annotations, with spatio-temporal links to the media assets. Input: Media sources. Output: annotation structure. . Query (5) Organization. Queries allow selecting appropriate annotations. Input: basic elements. Output: basic elements matching a specify query. . Construct message (6) Organization. Structuration of the presentation of data. Input: the ideas from the premeditate process, the annotation structure, queries. Output: draft of views. Figure 1. Traditional documentary production model EL 30,5 726
  • 7. . Organize (7) Organization. Definition of views to render the selected annotations. Input: basic elements. Output: view definitions. . Publish (8) Browsing, Publishing. Content packaging-publishing, means generation of documents from the templates, occurs in the browsing phase and also in the publishing phase. Input: basic elements. Output: a package and/or rendered views. . Distribute (9) Browsing, Publishing. The rendition of view is currently done through a standard web browser, or the instrumented video player integrated into the prototype. Hardman et al. (2008) identified a small set of canonical processes and specified their inputs and outputs, but deliberately do not specify their inner workings, concentrating rather on the information flow between them. Indicative examples of invoking canonical processes are given in (Aubert et al., 2008). Currently, many standards facilitate the exchange between the different media process stages (Pereira et al., 2008), such as MXF (Media Exchange Format), AAF (Advance Authoring Format), MOS (Media Object Server Protocol), and Dublin Core. The process of documentary authoring can be arranged in three phases: modeling, annotation and authoring of documentary media. (1) The modeling phase identifies the various semantics that exist in the documentary media. (2) The annotation phase provides the human annotator the various utilities for the free text representation of their perception of the documentary. (3) The authoring phase is meant for the semiautomatic translation of the annotated media information into XML, validated by the XML Schema validation tools. Using XML technologies, the semantic multimedia content of the documentary can be represented in an interoperable way. It is a good idea to propose substantial customizations based on XML technologies for the documentaries. Thus, the produced item will be an XML document that represents the annotation of the real-time video documentary. Documentary information systems must accommodate these three phases, providing a common framework for the storage of the authored documentary and for its presentation interface. Documentary analysis tools should perform the interpretation of documentaries in the context of culture, mode of documentary, mode of speech, action, gestures and emotions. Existing tools and systems provide annotation features for the documentary videos often based on a particular type of documentary (Mateas, 2000). In addition, they offer a limited number of annotation facilities, thus it becomes difficult to derive generic facilities. These tools do not provide semiautomatic authoring, which is an important requirement. It is worth mentioning that Bocconi et al. (2008) describe a model for automatically generating video documentaries. This allows viewers to specify the subject and the point of view of the documentary to be generated. However, the domain of Bocconi et al. is matter-of opinion documentaries based on interview. Agius and Angelides (2005) proposed the COSMOS-7 system that models the objects along with a set of events in which the objects participate, as well as events along with a set of objects and temporal relationships between the objects. This system/model Documentary media objects 727
  • 8. represents the events at a higher level only like speak, play, listen and not at the level of actions, gestures and movements. Harry and Angelides (2001) proposed a semantic content-based model for semantic-level querying that makes full use of the explicit media structure, objects, spatial relationships between objects, events and actions involving objects, temporal relationships between events and actions, and integration between syntactic and semantic information. Ramadoss and Rajkumar (2007) considered a system for the semiautomatic annotation of an audio-visual media of dance domain, while Nack and Putz (2004) presented a framework for the creation, manipulation, and archiving/retrieval of media documents, applied for the domain of News. In the digital games and entertainment industry, Burger (2008) stressed the importance of the use of formal semantics (ontologies) by providing a potential solution based on semantic technologies. AKTive Media (Chakravarthy et al., 2006) is an ontology-based cross-media annotation (images and text) system. It includes an automatic process of annotation by suggesting knowledge to the user in an interactive way while the user is annotating. This system actively works in the background, interacting with web services and queries the central annotational store to look for context specific knowledge. Chakravarthy et al. (2009) present OntoFilm, a core ontology for film production. OntoFilm provides a standardized model, which conceptualizes the domain and workflows used at various stages of the film production process starting from pre-production and planning, shooting on set, right through to editing and post-production. In this paper, we propose a documentary video framework in order to incorporate media semantics for documentaries. This framework provides the XML authored content of the documentary from the supplied semantic and semiotic annotations by the human annotators. The proposed requirements are: (1) A layer oriented model depicting the documentary pieces as events, which incorporates the gesture, actions and spatial-temporal relationships of the subjects (e.g. documentarists) and objects in a documentary. Besides documentary pieces, other examples for events are setup, background scene change, role change by a documentarist. (2) A semantic network representing the documentary, the individual documentary pieces, besides the cognitive aspects, setting, cultural features and story. (3) An annotation tool for the documentary experts to manually perform the semantic and semiotic annotations of the documentary media objects like documentary, documentarists etc. (4) A semantic querying tool for the documentary experts and users/spectators to browse and query the documentary media features for designing new documentary sequences. Some examples of documentary media or video queries are: . show me all the pieces of natural history documentaries from Africa; . tell me all documentary pieces where documentarist is in danger; and . find all historical documentary pieces representing the invasion of Normandy etc. The query engine should be assisted by proper representations so that the retrieved result achieves high precision and high recall. EL 30,5 728
  • 9. 3. The semantics of documentary media The spatial-temporal delivery of a sequence of the documentary pieces is recorded in a documentary video, in which each documentary piece consists of a set of subject’s actions. Each subject action denotes the action of the characters, such as commentarist, speaker, interviewee etc. The action is represented as , subject-verb-object-adverb . using verb-argument structure (Sarkar and Tripasai, 2002) that exists in Linguistics. This section explains some of the characteristics of documentary media briefly. Definition 3.1 (Documentary) The documentary numbered i DCi;n À Á consists of a set of documentary video clips Ci;j À Á performed at a particular setting. That is, DCi;n ¼ Ci;1; Ci;2; . . . ; Ci;n È É where n is the total number of documentary clips. In this sense, the documentary DC2;3 ¼ C2;1; C2;2 C2;3 È É denotes the second documentary that consists of three documentary clips C2;1; C2;2; C2;3 À Á . For example, if the second documentary DC2;3 is a travel documentary and is presenting Holidays in Greece, then the three video clips could be C2,1 ¼ Arriving at the airport of Athens, C2,2 ¼ Touring Athens and C2,3 ¼ Cruise in the Rodos island. Definition 3.2 (Documentary Clip) A documentary clip Ci;j of the documentary DCi;n consists of a set of documentary pieces (DP) that are performed by the documentarists. That is, Ci;j;m ¼ DPi;j1; DPi;j2; . . . ; DPi;jm È É where m is the total number of documentary pieces. For example, the documentary clip C2;3;7 ¼ DP2;3;1; DP2;3;2; DP2;3;3; DP2;3;4; DP2;3;5; DP2;3;6; DP2;3;7 È É denotes the third video clip (in our example Cruise in the Greek islands) of the second documentary. This clip includes seven documentary pieces: DP2,3,1, DP2,3,2, DP2,3,3, DP2,3,4, DP2,3,5, DP2,3,6, DP2,3,7. Definition 3.3 (Documentary Piece) A documentary piece is the basic semantic unit of a documentary, which has a set of subject’s actions that are performed either sequentially or concurrently by the subjects (documentarists). It encapsulates the mood, genre, culture, and characters, apart from the actions. A documentary piece DPi;j;k À Á of the video clip Ci, j represents a meaningful sequence of subject’s (documentarist) actions (A). DPi;j;k ¼ A1; A2; . . .Akf g where k is the total number of subject’s actions in this documentary piece. For example, the documentary piece DP2;3;4 ¼ A1; A2; A3; A4f g denotes that piece of the third video clip that (belongs to the second documentary) includes the first four sequential actions A1; A2; A3; A4 À Á performed by the subject (documentarist). In our example, these actions could be: A1: “The documentarist is visiting the main attractions of the Rodos island in Greece”. A2: “The documentarist is taking a swim”. A3: “The documentarist is participating in the local festival”. A4: “The documentarist is taking a taste of Rodos nightlife”. Documentary media objects 729
  • 10. Definition 3.4 (Subject’s (documentarist) action) The subject/documentarist’s action (A) is represented by an action of a character and is defined as a tuple, , Agent-Action-Target-Speed . where agent and target are the body-parts of the subject/object, action represents the static poses and gestures in the universe of actions and speed denotes the speed of the delivery of the actions, that is speed ¼ (low, medium, fast, gradual ascending, gradual descending). If only one agent involves in an action, then it is called primitive action. That is, the target agent is empty or Nil. For example, , documentaristi.larm move- nil-fast . shows that documentarist i moves his left arm fast. If multiple agents involve in an action or gesture, then the action is known as composite action. For instance, , Documentaristi.rhand – touch – gorillaj.head – low . denotes that documentarist i touches the head of gorilla j slowly with his right hand. The content representational structures for these documentary media semantics are discussed in following sections. 4. The architecture for authoring and querying documentaries The proposed system (shown in Figure 2) provides an environment supporting the annotation, authoring, archiving and querying of the documentary media objects. The aim is to apply the framework to all sorts of documentary types such as natural history documentary, travel documentary etc. The environment is based on various modules: annotation, archival, querying, representation structures and the underlying database. The documentary experts access each of these modules to carryout their specific tasks. It is essential for our developments that these modules need to be easy and simple for use, thereby minimizing the complexity of acquaintance with the system. The annotation module takes the raw digital video as input and allows the human annotator to annotate the different documentary media objects. The generated annotations are described in the representational structures such as linked lists and hash tables. The authoring module takes the annotations representing the documentary sequence and translates them into XML instances automatically. The XML Schema instances that are instantiated by the authoring module are stored in the back-end database. The query-processing module allows the documentary experts to pose the different free-text documentary video queries to the XML annotation, performs search using XQuery (after stemming, Figure 2. The architecture of the proposed system EL 30,5 730
  • 11. removing the stop words and converting the tokens into XQuery form) and returns the results of these queries back to the users. Based on the observation, we have identified a set of required data structures and the associated relations and have developed tools for accomplishing the documentary video tasks. Figures 3-5 depict the annotation, query and semantic annotation processes correspondingly. Figure 5. The semantic annotation process in a UML class diagram Figure 4. The query process in a UML class diagram Figure 3. The annotation process in a UML class diagram Documentary media objects 731
  • 12. 5. The model of semantics for documentary media According to Nack and Putz (2004) annotation is a dynamic and iterative process, and thus annotations should be incomplete and change over time. Consequently, it is imperative to provide semantic representation schemes with the capability to change and grow. In addition, the relation between the different types of structures should be flexible and dynamic. To achieve this, media annotation should not result to a monolithic document, rather it should be organized as a semantic network of content description documents (Ramadoss and Rajkumar, 2007). 5.1 Layer oriented event description In the design of the proposed system, we adopted the strata-oriented approach (Aguierre Smith and Davenport, 1992) and setting (Parkes, 1989) for describing the events such as documentary pieces. Strata oriented content modeling is an important knowledge representation method and more suitable to model the events of the documentary presentation. In our framework, each video documentary is technically described using the size, duration, technical format of the material such as such as mpg, avi etc. Therefore, each documentary can be represented partially using technical details that belong to the layer of technical details. In addition, each video documentary is conceptually annotated using high-level semantic descriptors and thus it can be complementarily represented using such semantic descriptors that belong to the layer of semantic annotations. The connection between the different layers is accomplished by a triple , media identifier, start time, end time . . The proposed representation structure includes many layers (one layer for each description). The triple identifier is applied in order to be achieved the connection between the different layers and the data to be described (e.g. the actual audio, video, or audio visual stream). For instance, a documentarist may perform a number of actions in the same time span. Start and end time can be used to identify the temporal relation between the actions. Documentary pieces can be represented in this way, thereby enabling semantic retrieval. Figure 6 depicts the layered representation of a shot of 100 frames, representing three actions. Suppose a query “find a documentary piece of a natural history documentary from Africa, where documentarist is speaking and touching a gorilla, while gorilla is eating a banana”. This question can be easily retrieved by isolating the common parts of the shot as depicted in shaded portion of Figure 6. The temporal relationship between them can be identified using the start and end point with which those actions are associated. In this way, complex structured behavior concepts can be represented and hence the audio-visual material retrieved on this basis. Figure 6. Layered annotation of actions and isolated segment of a shot a query EL 30,5 732
  • 13. 5.2 Nodes of the proposed framework Nodes are used to build linked data structures concerning documentaries. Each node contains some data and possibly links to other nodes. A node can be thought of as a logical placeholder for some data. It is a memory block, which contains some data unit and perhaps references to other nodes, which in turn contain data and perhaps references to yet more nodes. Links between nodes are implemented by pointers or references. By forming chains of interlinked nodes, very large and complex data structures concerning documentaries can be formed. As a consequence, semantic structures of documentary’s pieces can be implemented easily. In our framework, we distinguish two types of nodes, i.e. data nodes (D-nodes) and conceptual annotation nodes (CA-nodes): (1) A D-node represents physical audio-visual material of any media type, such as text, audio, video, 3D animation, 2D image, 3D image, and graphic. The size, duration, and technical format of the material is not restricted, nor are any limitations present with respect to the content, i.e. number of persons, actions and objects. A data node might contain a complete documentary film or merely a scene. The identification of the node is realised via a URI. (2) A CA-node provides high-level descriptions of a video documentary. A high-level description is one that describes “top-level” goals, overall features of a documentary, is more abstracted, and is typically more concerned with the video documentary as a whole, and its goals. For example, the events occur in a documentary (as well as the location, date and time of an event) can be described by high-level descriptors. The mood (e.g. subjective content-happy, sorrow, romantic etc) of a documentary and so many other features can also be described by high-level descriptors. Such descriptors are usually difficult to retrieve using automatic extraction methods. This type of nodes is usually created manually. Each node is best understood as an instantiated schema. The available number of node schemata is restricted, thus indexing and classification can be performed in a controlled way, whereas the number of provided nodes in the descriptional information space might consist of just one node or up to n nodes. The obvious choice for representing CA-nodes, each of them describing audiovisual content, would have been using the DDL of MPEG-7 or suggested schemata by MPEG-7. The MPEG-7 standard (Martinez et al., 2002; Salembier and Smith, 2002) concentrates on multimedia content description and constitutes the greatest effort for multimedia description. It is based on a set of XML Schemas that define 1,182 elements, 417 attributes and 377 complex types. It is divided into four main components: (1) the Description Definition Language (DDL, the basic building blocks for the MPEG-7 metadata language); (2) audio (the descriptive elements for audio); (3) visual (those for video); and (4) the Multimedia Description Schemes (MDS, the descriptors for capturing the semantic aspects of multimedia contents, e.g. places, documentarists, objects, events, etc). Documentary media objects 733
  • 14. We do not choose using MPEG-7 because the main weakness of the MPEG-7 standard is that formal semantics are not included in the definition of the descriptors in a way that can be easily implementable in a system (Nack et al., 2005). Therefore, we chose to use XML Schema as a representational scheme for the documentary media due to its simplicity and maturity. The use of XML technologies implies that a great part of the semantics remains implicit. Therefore, each time an application is developed; semantics must be extracted from the standard and re-implemented. For our documentary media environment, we have developed a set of 14 schemata that describe the denotative and technical content of the documentary video. The schemata are designed such a way that they are semi-automatically instantiated or authored. These are shown in Table II. The XML schema representation of the 14 schemes can be found in Subsection 5.4. With these schemes one can perform the browse (e.g. documentary, actions, documentarists, documentary piece, culture, objects etc) and semantic search (e.g. show me all natural history documentary pieces). 5.3 Relationships In our framework, all metadata about the actual audio and video streams of the documentary are organized in the form of a semantic network. A semantic network is a network that represents semantic relations among concepts. This is often used as a form of knowledge representation and it is a directed or undirected graph consisting of vertices, which represent concepts, and edges. Figure 7 depicts a possible semantic net of a documentary annotation. From this figure, we can also understand the two ways of annotating documentary data, based on the requirements of the documentary expert. Schema Description Documentary High-level organizational scheme of a documentary presentation containing all documentary clips Documentary Clip High-level scheme of a documentary consisting of all annotations and relations to other clips Documentary Piece An event representing a meaningful collection of the actions of documentarists Subject/Documentarist’s Action The basic pose, gesture or action done by the documentarist Event The event that occurs in a documentary clip Person Person participating in a documentary, e.g. documentarists, interviewees, narrators, speakers Emotion Subjective content like mood or feeling etc Setting The location, date time of an event LifeSpan Duration with start and end times Relation Between documentary media elements STRelation Spatial-temporal relationships of the documentarist Link Connections between the media source and the document schemes Resource Relation to any URI address Basic Info Basic information about the documentary such as language, video type, recording information, archive information, access rights etc Table II. Schemata for documentaries EL 30,5 734
  • 15. (1) either as part of a documentary; or (2) as a single documentary clip representing one documentary. Annotation networks of a documentary, clip, documentary piece, media source can be interconnected together with the links and relations. There are two types of connections among the nodes: (1) Link type: to connect media source and description nodes (represented using arrow). (2) Relation type: to connect different annotation nodes (represented using line). Link connects the media source (audio and video files) to the data node along with its life spans (i.e. on a temporal level). The XML schema representation of Link type is shown below. 5.4 Description schemes for documentaries in XML Schema The XML schema representation of the relation types is presented hereafter (Figures 8-10). In our environment, DocumentaryDS and DocumentaryClipDS hold link types, enabling connections to the documentary video and audio sources. Note that, these two description schemes serve as an entry point to the semantic network. Our front-end annotation tool performs the semiautomatic instantiation of links. Relation types perform the connection among the description schemes that are represented as CA-nodes. Between two nodes, there may exist up to m relationships and we define the following relations for our documentary media environment. . For events: follows, precedes. . For character, setting, object: part of, association, before, equal, meets, overlaps, during, starts, finishes. . For documentary pieces: we propose two temporal semantic relationships for the documentary pieces: follows and precedes. These temporal semantic relationships help to infer the type of documentary during query processing. In our environment, relationships are instantiated Figure 7. A semantic net of a documentary annotation Documentary media objects 735
  • 16. semi-automatically by the tool. We now introduce our documentary annotation and querying tool to instantiate the description schemes that have been designed based on the concepts of semantic net. Also, we then introduce our search engine that allows the users to browse and query the documentary features for composing new documentaries and for learning purposes. Figure 8. EL 30,5 736
  • 19. 6. Tools for documentaries 6.1 Annotation and authoring tool Documentary experts can annotate the documentary or clip by looking at the running video and using the annotation tool. The video player provides all the standard facilities like play, start, stop, pause and replay. We used the Cinepak codec for the conversion of the running video (WinAmp media file) to AVI format. The annotation tool provides to the documentary experts the facility to annotate the documentary pieces using free-text and controlled vocabulary independently on the storage organization of the generated annotations. We developed the annotation tool by using J2SE1.5 and Java Media Framework 2.0. Figure 11 depicts the GUI of the initial screen for determining the documentary information. It is noteworthy that a documentary, a documentary clip constitutes an entry point to the annotation. The annotation process begins by the documentary expert with describing the metadata about the documentary. The basic metadata (descriptions) those are common for all documentaries are shown in Table III. Once the annotation of the documentary has been completed, the documentary expert can describe individual documentary presentations that are part of that documentary. We have identified a set of features that correspond to a documentary clip as depicted in Table IV. The metadata describing a documentary piece that can be annotated through the annotation tool are as follows (Table V). The metadata about the person, object and basic media info are shown in Tables VI-VIII, respectively. Figure 11. A snapshot of the annotation tool for determining the documentary information Documentary media objects 739
  • 20. The semi-automated editing suite (Figure 12) provides the documentary expert with an instant overview of the available material and its essential relations represented through the spatial order of its presentation. The documentary expert can mark the relevant video clips or pieces by pointing at the preferred clips or pieces. The order of pointing indicates the sequential appearance of the clips or pieces. The editing suite based on a simple planner performs an automated composition of the documentary clip. At the present stage of development our editing suite uses the meta-information obtained from the annotation tool to support the video editing process. Documentary piece Description MoodID Subjective content-happy, sorrow, romantic, etc Culture Indian, western, etc documentary pieces Genre Such as poetic, expository, observational participatory, reflexive, performative Mode of documentary speech Commentary speech, presenter speech, interview speech in shot, overhead interchange, dramatic dialogue Object Background and foreground objects used in a documentary piece Action Spatial-temporal actions, gestures, poses of the characters Agent Body parts involved Related action Associated action Target Target body part of the opponent if any Speed Slow, medium, fast, gradual ascending, gradual descending Life span Duration of the documentary piece Table V. Metadata of a documentary piece Documentary clip Description Character name, role, gender, life span Role played by the documentarist such as commentarist, presenter etc. Life span of the character is necessary. Because several roles by the same documentarist in a documentary clip are possible Context Identifies whether it is a historical, travel or documentary without words etc Documentary genre Such as poetic, expository, observational participatory, reflexive, performative Language Language used by the documentarists in the audio. Several languages may be used in the same documentary Life span Duration of the documentary clip Table IV. Metadata of a documentary clip Documentary Description Date and time Date and time of video recording of the documentary Media locator Links to video and audio streams Media format Format of the video such as mpg, avi etc Media type Type of the media like video, audio, text etc Title Name of the documentary Origin Originating country of the documentary Duration Life span, i.e. length of the documentary in minutes Table III. Metadata of a documentary EL 30,5 740
  • 21. 6.2 Search engine The search engine facilitates the documentary experts to design a new documentary and users to view the documentary pieces themselves. In particular, user can search in many dimensions for specific documentary pieces belonging to a video clip. For example, user can search for all documentary pieces denoting specific objects such as sun, moon etc. In addition, user can search for certain subject’s actions incorporated into documentary pieces. Furthermore, user can search for documentary pieces, where the subject (e.g. documentarist) has certain mood (happy, angered etc). In another case, user can search for documentary pieces, in which the speed of the delivery of subject’s actions are low, or medium or fast or gradual ascending or gradual descending. User can also search for documentary pieces in which a “specific” song is played. Finally, user can use this search engine as a browsing tool with several built in categories of the documentary information and as a query tool to pose free text documentary queries. The retrieval tool facilitates several browsing features for the users. These are: Documentary To browse all documentary clips along with their video of the documentary pieces. Output is rendered in the output window. Documentary clip To view all documentary pieces of a clip. Documentary piece To view all subject/documentarist actions of a particular clip. Objects Displays all documentary pieces denoting sun, moon, etc. Tempo Users can browse the documentary pieces according to the speed categories. Person Description Name Name of the person Function Commentarist, speaker, interviewee E-mail Contact details Table VI. Metadata of persons Object Description Name Name of the background or foreground object Type Background or foreground object Number of Number of objects Shape Shape of the object (in text) Color Color of the object (in text) Texture Pattern Table VII. Metadata of objects Basic information Description Recording speed Speed of recording Camera details Description of the camera used while recording the documentary Access rights Access information Table VIII. Metadata of media Documentary media objects 741
  • 22. Mood To browse according to the feeling like happy, romantic, etc. Culture Indian, western, etc. Documentarist All documentary pieces that are part of a documentarist. Genre Poetic, expository, observational, participatory, reflexive, performative, etc. Speech mode Commentary speech, presenter speech, interview speech in shot, overhead interchange, dramatic dialogue. Actions View by specific actions. Song View documentary pieces of a song. Documentary users/spectators can submit their documentary queries in the query window using keywords as free text. For example, consider the query Q: Show me all pieces of natural history documentaries. Our framework uses a semantic information retrieval mechanism, which is similar to that presented in Chen et al. (2010). The use of semantic information, especially which derived from spatio-temporal analysis is of great value in multimedia annotation, archiving and retrieval. Ren et al. (2009) survey the use of spatiotemporal semantic knowledge for information-based video retrieval and draw important conclusions on where future research is headed. Liu and Chen (2009) present a novel framework for content-based video retrieval. They use an unsupervised learning Figure 12. The semi-automated editing suite for documentary clips EL 30,5 742
  • 23. method to automatically discover and locate the object of interest in a video clip. This unsupervised learning algorithm alleviates the need for training a large number of object recognizers. Regional image characteristics are extracted from the object of interest to form a set of descriptors for each video. A novel ensemble-based matching algorithm compares the similarity between two videos based on the set of descriptors each video contains. Videos containing large pose, size, and lighting variations are used to validate their approach. Finally, Chen et al. (2010) developed a semantic-enable information retrieval mechanism that handles the processing, recognition, extraction, extensions and matching of content semantics to achieve the following objectives to: . analyze and determine the semantic features of content, to develop a semantic pattern that represents semantic features of the content, and to structuralize and materialize semantic features; . analyze user’s query and extend its implied semantics through semantic extension so as to identify more semantic features for matching; and . generate contents with approximate semantics by matching against the extended query to provide correct contents to the querist. This mechanism is capable of improving the traditional problem of keyword search and enables the user to perform a semantic-based query and search for the required information, thereby improving the reusing and sharing of information. 7. Future work: an ontology for video documentaries Multimedia ontologies (especially MPEG-7-based ontologies) have the potential to increase the interoperability of applications producing and consuming multimedia annotations. Hunter (2003) provided the first attempt to model parts of MPEG-7 in RDFS, later integrated with the ABC model. Tsinaraki et al. (2004) start from the core of this ontology and extend it to cover the full Multimedia Description Scheme (MDS) part of MPEG-7, in an OWL DL ontology. Isaac and Troncy (2004) proposed a core audio-visual ontology inspired by several terminologies such as MPEG-7, TV Anytime or ProgramGuideML., while Garcia and Celma (2005) produced the first complete MPEG-7 ontology, automatically generated using a generic mapping from XSD to OWL. All these methods perform a one to one translation of MPEG-7 types into OWL concepts and properties. This translation however does not guarantee that the intended semantics of MPEG-7 is fully captured and formalized. On the contrary, the syntactic interoperability and conceptual ambiguity problems remain. A video documentary ontology can increase the interoperability of documentary authoring tools. It can represent documentary concepts and their relationships that will help to retrieve the required result. From another perspective, the application of multimedia reasoning techniques on top of semantic multimedia annotations can enable a multimedia authoring application more intelligent (Van Ossenbruggen et al., 2004). Currently, we are engaged in representing the complete media semantics of a documentary using the Web Ontology Language (OWL) (Smith et al., 2004). We aim to describe the video documentary ontology. In the near future, we will examine how we can raise the quality of documentary annotation and improve the usability of content-based video search and retrieval systems. Figure 13 depicts a portion of our ontology for documentaries. Documentary media objects 743
  • 24. 8. Conclusions Tools for automatically understanding video are required in the documentary domain. Semantics-based annotations will break the traditional linear manner of accessing and browsing documentaries and will support vignette-oriented access of audio and video. In this paper, we have presented a framework for the modeling, annotation, and retrieval of media documents, applied to the domain of documentary. Using a basic set of 14 semantic description schemes, we demonstrated how a documentary video can be annotated and how this information can be used for the retrieval to support documentary design. We emphasized tools and technologies for the manual annotation of the documentary media objects. Flexible annotation facilities are required to facilitate documentary creativity by way of semantic networks because the annotation process is dynamic and annotations can grow over time. We have proposed a flexible organization of media content description and the related media data. This organization requires the adaptable construction in the form of a semantic network. The proposed concept features three significant functions, which make it suitable as a platform for supporting the needs of documentary production: (1) It provides semantic and technical memory structures (i.e. information nodes) with the capability to change and grow, allowing an ongoing task specific process of inspection and interpretation of source material. (2) Our approach facilitates the dynamic use of audio-visual material using links, enabling the connection from multi-layered information nodes to data on a temporal, spatial and spatial-temporal level. Moreover, since the description of media content holds constant for the associated time interval, we are now in the position to handle multiple content descriptions for the same media unit and also to handle gaps. (3) It enables the semantic connection between information nodes using typed relations, thus structuring the information space on a semantic as well as syntactic level. We believe that our approach (audio-visual strategy) can be used for improving training and education in documentary communication and to this end we have also indicated future efforts to create an ontology for video documentaries with enhanced annotation. Figure 13. A part of the domain ontology for documentaries EL 30,5 744
  • 25. References Agius, H. and Angelides, M. (2005), “COSMOS-7: video-oriented MPEG-7 scheme for modeling and filtering of semantic content”, The Computer Journal, Vol. 48 No. 5, pp. 545-62. Aguierre Smith, T.G. and Davenport, G. (1992), “The stratification system: a design environment for random access video”, Proceedings of the ACM Workshop on Networking and Operating System Support for Digital Audio and Video, San Diego, CA, Lecture Notes in Computer Science, Vol. 712, Springer, Berlin, pp. 250-61. Aubert, O., Champin, P.-A., Prie´, Y. and Richard, B. (2008), “Canonical processes in active reading and hypervideo production”, Multimedia Systems Journal, Vol. 14 No. 6, pp. 427-33. Barnouw, E. (1993), Documentary: A History of the Non-fiction Film, Oxford University Press, Oxford. Benitez, A., Paek, S., Chang, S.-F., Puri, A., Huang, Q., Smith, J., Li, C.-S., Bergman, L. and Judice, C. (2000), “Object-based multimedia content description schemes and applications for MPEG-7”, Signal Processing: Image Communication, Vol. 16 Nos 1/2, pp. 235-69. Bocconi, S., Nack, F. and Hardman, L. (2008), “Automatic generation of matter-of-opinion video documentaries”, Journal of Web Semantics, Vol. 6, pp. 139-50. Brachman, R.J. and Levesque, H.J. (1983), Readings in Knowledge Representation, Morgan Kaufmann, San Mateo, CA. Burger, T. (2008), “The need for formalizing media semantics in the games and entertainment industry”, Journal of Universal Computer Science, Vol. 14 No. 10, pp. 1775-91. Chakravarthy, A., Ciravegna, F. and Lanfranchi, V. (2006), “Cross-media document annotation and enrichment”, Proceedings of the 1st Semantic Authoring and Annotation Workshop (SAAW 2006), Athens, GA, November 6. Chakravarthy, A., Beales, R., Matskanis, N. and Yang, X. (2009), “OntoFilm: a core ontology for film production”, in Chua, T.-S., Kompatsiaris, Y., Me´rialdo, B., Haas, W., Thallinger, G. and Bailer, W. (Eds), Proceedings of the 4th International Conference on Semantic and Digital Media Technologies (SAMT 2009), Lecture Notes in Computer Science, Vol. 5887, Springer, Berlin, pp. 177-81. Chen, M.-Y., Chu, H.-C. and Chen, Y.M. (2010), “Developing a semantic-enable information retrieval mechanism”, Expert Systems with Applications, Vol. 37 No. 1, pp. 322-40. Choi, I. (2010), “From tradition to emerging practice: a hybrid computational production model for interactive documentary”, Entertainment Computing, Vol. 1 Nos 3/4, pp. 105-17. Costa, M., Correia, N. and Guimaraes, N. (2002), “Annotations as multiple perspectives of video content”, Proceedings of the ACM Conference on Multimedia, San Francisco, CA, 2-7 November, pp. 283-6. Dorai, C. and Venkatesh, S. (2001), “Computational media aesthetics: finding meaning beautiful”, IEEE Multimedia, Vol. 8 No. 4, pp. 10-12. Dorai, C., Mauthe, A., Nack, F., Rutledge, L., Sikora, T. and Zettl, H. (2002), “Media semantics: who needs it and why?”, Proceedings of Multimedia ’02, December 1-6, Juan-les-Pins, pp. 580-3. Eco, U. (1997), A Theory of Semiotics, Macmillan, London. Garcia, R. and Celma, O. (2005), “Semantic integration and retrieval of multimedia metadata”, Proceedings of the Fifth International Workshop on Knowledge Markup and Semantic Annotation, 7 November, Galway. Geurts, J., Bocconi, S., van Ossenbruggen, J. and Hardman, L. (2003), “Towards ontology-driven discourse: from semantic graphs to multimedia presentations”, in Fensel, D., Sycara, K. and Mylopoulos, J. (Eds), Proceedings of the Second International Semantic Web Conference (ISWC 2003), Sanibel Island, FL, 20-23 October, Springer, Berlin. Documentary media objects 745
  • 26. Hardman, L., Obrenovic, Zˇ., Nack, F., Kerherve´, B. and Piersol, K. (2008), “Canonical processes of semantically annotated media production”, Multimedia Systems, Vol. 14, pp. 327-40. Harry, W.A. and Angelides, M.C. (2001), “Modeling content for semantic level querying of multimedia”, Multimedia Tools and Applications, Vol. 15 No. 1, pp. 5-37. Hunter, J. (2003), “Enhancing the semantic interoperability of multimedia through a core ontology”, IEEE Transactions: Circuits and Systems for Video Technology, Vol. 13 No. 1, pp. 49-58. IBM (n.d.), “alphaWorks community, VideoAnnEx annotation tool”, available at: www. alphaworks.ibm.com/tech/videoannex IMDb (n.d.), “World Travels”, available at: www.imdb.com/title/tt1392723/ Isaac, A. and Troncy, R. (2004), “Designing and using an audio-visual description core ontology”, paper presented at the Workshop on Core Ontologies in Ontology Engineering, 5-8 October, Whittlebury. Korsakow (n.d.), “Korsakow system”, available at: www.korsakow.com/ksy/index.html Little, S., Geurts, J. and Hunter, J. (2002), “Dynamic generation of intelligent multimedia presentations through semantic inferencing”, Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, Pontifical Gregorian University, Rome, Springer, Berlin. Liu, D. and Chen, T. (2009), “Video retrieval based on object discovery”, Computer Vision and Image Understanding, Vol. 113 No. 3, pp. 397-404. Martinez, J., Koenen, R. and Pereira, F. (2002), “MPEG-7 – The generic multimedia content description standard Part 1”, IEEE MultiMedia Magazine, Vol. 9 No. 2, pp. 78-87. Mateas, M. (2000), “Generation of ideologically-biased historical documentaries”, Proceedings of the 17th National Conference on Artificial Intelligence and Innovative Applications of Artificial Intelligence Conference (AAAI-00), Austin, TX, pp. 36-42. Murtaugh, M. (1996), “The automatist storytelling system”, PhD thesis, Massachusetts Institute of Technology, available at: http://alumni.media.mit.edu/,murtaugh/thesis/ Nack, F. and Parkes, A. (1997), “Towards the automated editing of theme-oriented video sequences”, Applied Artificial Intelligence, Vol. 11 No. 4, pp. 331-66. Nack, F. and Putz, W. (2004), “Saying what it means: semi-automated (news) media annotation”, Multimedia Tools and Applications, Vol. 22 No. 3, pp. 263-302. Nack, F., Ossenbruggen, J.v. and Hardman, L. (2005), “That obscure object of desire: multimedia metadata on the web (Part II)”, IEEE Multimedia, Vol. 12 No. 1, pp. 54-63. Nichols, B. (2001), “What types of documentary are there?”, Introduction to Documentary, Indiana University Press, Bloomington, IN, pp. 99-138. Parkes, A.P. (1989), “Settings and the settings structure: the description and automated propagation of networks for perusing videodisk image states”, in Belkin, N.J. and Rijsbergen, C.J. (Eds), Proceedings of SIG Information Retrieval ’89, Cambridge, MA, ACM Press, New York, NY, pp. 229-38. Pereira, F., Vetro, A. and Sikora, T. (2008), “Multimedia retrieval and delivery: essential metadata challenges and standards”, Proceedings of the IEEE, Vol. 96 No. 4, pp. 721-44. Ramadoss, B. and Rajkumar, K. (2007), “Semi-automated annotation and retrieval of dance media objects”, Cybernetics and Systems, Vol. 38 No. 4, pp. 349-79. Ren, W., Singh, S., Singh, M. and Zhu, Y.S. (2009), “State-of-the-art on spatio-temporal information-based video retrieval”, Pattern Recognition, Vol. 42 No. 2, pp. 267-82. Rijksmuseum (n.d.), available at: www.rijksmuseum.nl EL 30,5 746
  • 27. Rincon, M. and Martinez-Cantos, J. (2007), “An annotation tool for video understanding”, in Moreno-Dı´az, R., Pichler, F. and Quesada Arencibia, A. (Eds), Proceedings of the 11th International Conference on Computer Aided Systems Theory and Technology (EUROCAST 2007), Las Palmas, 12-16 February, Lecture Notes in Computer Science, Vol. 4739, Springer, Berlin, pp. 701-8. Rosenthal, A. and Corner, J. (2005), New Challenges for Documentary, 2nd ed., Manchester University Press, Manchester. Ryn, J., Sohn, Y. and Kin, M. (1989), “MPEG-7 metadata authoring tool”, Proceedings of the ACM Conference on Multimedia, pp. 267-70. Salembier, P. and Smith, J. (2002), “Overview of MPEG-7 multimedia description schemes and schema tools”, in Manjunath, B.S., Salembier, P. and Sikora, T. (Eds), Introduction to MPEG-7: Multimedia Content Description Interface, Wiley, Chichester. Sarkar, A. and Tripasai, W. (2002), “Learning verb argument structure from minimally annotated corpora”, Proceedings of the 19th International Conference on Computational Linguistics, August 24-September Vol. 1, Taipei, pp. 1-7. Smith, M.K., Welty, C. and McGuinness, D.L. (2004), “OWL web ontology language, W3C recommendation”, available at: www.w3c.org/TR/owl-guide/ Soft Cinema (n.d.), available at: www:softcinema.net Sourcegorge.net (n.d.), “VIPER-GT annotation tool”, available at: http://viper-toolkit. sourcegorge.net The Language Archive (n.d.), “ELAN annotation tool”, available at: www.lat-mpi.eu/tools/elan Tien, T.T. and Cecile, R. (2003), “Multimedia modeling using MPEG-7 for authoring multimedia integration”, Proceedings of the ACM Conference on Multimedia Information Retrieval, pp. 171-8. Tsinaraki, C., Polydoros, P. and Christodoulakis, S. (2004), “Integration of OWL ontologies in MPEG-7 and TVAnytime compliant semantic indexing”, Proceedings of the 16th International Conference on Advanced Information Systems Engineering (CAiSE 2004), Riga, June 7-11, pp. 143-61. (The) University of Edingurgh (n.d.), “CAVIAR: Context Aware Vision using Image-based Active Recognition”, available at: http://homepages.inf.ed.ac.uk/rbf/CAVIAR/ Van Ossenbruggen, J., Nack, F. and Hardman, L. (2004), “That obscure object of desire: multimedia metadata on the Web (Part I)”, IEEE Multimedia, Vol. 11 No. 4, pp. 38-48. About the author Dimitris Kanellopoulos holds a PhD in multimedia communications from the Department of Electrical and Computer Engineering of the University of Patras, Greece. He is a member of the Educational Software Development Laboratory in the Department of Mathematics at the University of Patras. His research interests include multimedia communications, knowledge representation, intelligent systems, and Web engineering. He has authored many papers in international journals and conferences at these areas. He serves as a member of the editorial boards in ten academic journals. Dimitris Kanellopoulos can be contacted at: d_kan2006@yahoo.gr Documentary media objects 747 To purchase reprints of this article please e-mail: reprints@emeraldinsight.com Or visit our web site for further details: www.emeraldinsight.com/reprints
  • 28. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.